US-12619625-B2 - System for performing data transformations using a set of independent software components
Abstract
Described is a system (and method) that provides a framework for performing data transformations, which may be part of an Extract, Transform, and Load (ETL) process. The system may perform a data transformation by creating a pipeline that executes a set of independent software components (or component, plugins, add-ons, etc.). The components may be executed as individual services (e.g., microservices) that may be provided within containers to allow the components to be deployed as self-contained units on various types of host systems including cloud-based infrastructures. In addition, to provide further flexibility for the framework, the components may be implemented using preexisting software libraries.
Inventors
- Meher Ram Janyavula
- Arunkumar Rajappan
- Muralikumar Venkatasubramaniam
Assignees
- CAPITAL ONE SERVICES, LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20191213
Claims (20)
- 1 . A method comprising: receiving, by at least one processor, a data retrieval request specifying: at least one source database from which to retrieve account data and at least one target database to which to load the account data; identifying, by the at least one processor, at least one source data type associated with the account data in the at least one source database; identifying, by the at least one processor, at least one target data type associated with the at least one target database; determining, by the at least one processor, a particular transformation graph from a library of transformation graphs associated with the at least one source data type and the at least one target data type, the particular transformation comprising a plurality of modular operation-specific components representing a plurality of modular transformation operations; wherein the particular transformation is configured to transform the account data of the at least one source data type to the at least one target data type; wherein the plurality of modular operation-specific components are configured to be deployable across a plurality of different types of database infrastructures; wherein each modular operation-specific component of the plurality of modular operation-specific components comprises a distinct software package configured to perform at least one transformation operation of the plurality of transformation operations; generating, by the at least one processor, a transformation workflow associated with the data retrieval request based at least in part on the plurality of modular operation-specific components; wherein the transformation workflow specifies a sequence of the plurality of modular operation-specific components to perform the plurality of transformation operations so as to transform the account data from at least one source data format to the at least one target data format; determining, by the at least one processor, execution scheduling for each modular operation-specific component of the plurality of modular operation-specific components based at least in part on the transformation workflow; requesting, by the at least one processor, a cloud-based host system to instantiate a plurality of containers for the plurality of modular operation-specific components, wherein each respective container is configured to establish a separate execution environment with a separate processing thread for each respective modular operation-specific component; receiving, by the at least one processor, at least one modification to the plurality of modular operation-specific components in the transformation workflow; modifying, by the at least one processor, at least on modular operation-specific component of the plurality of modular operation specific components while executing the transformation workflow, wherein executing the transformation workflow comprising, at each transformation operation of the plurality of transformation operations in the executing scheduling, invoking, via the cloud-based host system, each respective container to execute each respective modular operation-specific component of the plurality of modular operation-specific components with the at least one modification to cause at least one processing resource to transform the account data of a data table structure as part of the plurality of transformation operations performed by the plurality of modular operation-specific components, wherein the plurality of transformation operations comprises: identifying updated data based upon a comparison of a first subset of the account data associated with a current day and with a second subset of the account data associated with a previous day; combining the updated data with new data corresponding to at least one new account to produce combined data; and joining the combined data with a copy of the account data of a second table data structure to produce joined data.
- 2 . The method of claim 1 , wherein an execution of the sequence of the plurality of modular operation-specific components is performed as a data pipeline.
- 3 . The method of claim 2 , further comprising: utilizing, by the at least one processor, at least one transformation component to perform the sequence; wherein each modular operation-specific component is implemented using a python data analysis library, and the table data structure includes a DataFrame object of the python data analysis library.
- 4 . The method of claim 3 , wherein the at least one transformation component is implemented as an independent microservice or service.
- 5 . The method of claim 4 , wherein each microservice or service is deployed within a container on a host that is remote from the at least one transformation component, and the execution of the at least one transformation component is initiated by an instruction to the at least one service using a communication protocol.
- 6 . The method of claim 1 , wherein transforming the account data includes the sequence of the plurality of modular operation-specific components comprises: deduplicating, by a deduplication component, the joined data to remove duplicate records of the joined data associated with the current day and the previous day.
- 7 . The method of claim 1 , wherein the sequence of the plurality of modular operation-specific components comprises: filtering the account data of the table data structure using a filter component to produce filtered data and rejected data; and writing the rejected data to a reject file.
- 8 . The method of claim 7 , wherein the sequence of the plurality of modular operation-specific components comprises: reformatting the account data of the table data structure using a first reformat component to produce first reformatted data; retrieving data from one or more columns of an input file using the filtered data and the first reformatted data; and extracting a set of fields from the data in the input file.
- 9 . The method of claim 8 , wherein the sequence of the plurality of modular operation-specific components further comprises: reformatting the account data of the second data table structure using a second reformat component to produce second reformatted data; replicating the second reformatted data using a replicate component to produce first and second copies of the second reformatted data; and joining the first reformatted data and the first copy of the second reformatted data using a first join component to produce the joined data, wherein the updated data is based upon a comparison of the account data associated with the current day in the first reformatted data and the account data associated with the previous day in the first copy of the second reformatted data.
- 10 . The method of claim 9 , wherein joining the combined data with a copy of the account data of the table data structure to produce the joined data comprises joining the combined data and the second copy of the second reformatted data using a second join component to produce second joined data.
- 11 . A system, comprising: one or more processors; and a memory coupled to the one or more processors, the memory storing instructions, which when executed by the one or more processors, cause the one or more processors to perform operations comprising: receiving a data retrieval request specifying: at least one source database from which to retrieve account data and at least one target database to which to load the account data; identifying at least one source data type associated with the account data in the at least one source database; identifying at least one target data type associated with the at least one target database; determining a particular transformation graph from a library of transformation graphs associated with the at least one source data type and the at least one target data type, the particular transformation comprising a plurality of modular operation-specific components representing a plurality of modular transformation operations; wherein the particular transformation is configured to transform the account data of the at least one source data type to the at least one target data type; wherein the plurality of modular operation-specific components are configured to be deployable across a plurality of different types of database infrastructures; wherein each modular operation-specific component of the plurality of modular operation-specific components comprises a distinct software package configured to perform at least one transformation operation of the plurality of transformation operations; generating a transformation workflow associated with the data retrieval request based at least in part on the plurality of modular operation-specific components; wherein the transformation workflow specifies a sequence of the plurality of modular operation-specific components to perform the plurality of transformation operations so as to transform the account data from at least one source data format to the at least one target data format; determining execution scheduling for each modular operation-specific component of the plurality of modular operation-specific components based at least in part on the transformation workflow; requesting a cloud-based host system to instantiate a plurality of containers for the plurality of modular operation-specific components, wherein each respective container is configured to establish a separate execution environment with a separate processing thread for each respective modular operation-specific component; receiving at least one modification to the plurality of modular operation-specific components in the transformation workflow; modifying at least on modular operation-specific component of the plurality of modular operation specific components while executing the transformation workflow, wherein executing the transformation workflow comprising, at each transformation operation of the plurality of transformation operations in the executing scheduling, invoking, via the cloud-based host system, each respective container to execute each respective modular operation-specific component of the plurality of modular operation-specific components with the at least one modification to cause at least one processing resource to transform the account data of a data table structure as part of the plurality of transformation operations performed by the plurality of modular operation-specific components, wherein the plurality of transformation operations comprises: identifying updated data based upon a comparison of a first subset of the account data associated with a current day and with a second subset of the account data associated with a previous day; combining the updated data with new data corresponding to at least one new account to produce combined data; and joining the combined data with a copy of the account data of a second table data structure to produce joined data.
- 12 . The system of claim 11 , wherein an execution of the sequence of the plurality of modular operation-specific components is performed as a data pipeline.
- 13 . The system of claim 12 , further comprising: utilizing, by the at least one processor, at least one transformation component to perform the sequence; wherein each modular operation-specific component is implemented using a python data analysis library, and the table data structure includes a DataFrame object of the python data analysis library.
- 14 . The system of claim 13 , wherein the at least one transformation component is implemented as an independent microservice or service.
- 15 . The system of claim 14 , wherein each microservice or service is deployed within a container on a host that is remote from the at least one transformation component, and the execution of the at least one transformation component is initiated by an instruction from the system using a communication protocol.
- 16 . The system of claim 11 , wherein transforming the data includes the sequence of the plurality of modular operation-specific components comprises: deduplicating, by a deduplication component, the joined data to remove duplicate records of the joined data associated with the current day and the previous day.
- 17 . The system of claim 11 , wherein the sequence of the plurality of modular operation-specific components comprises: filtering the account data of the first table data structure using a filter component to produce filtered data and rejected data; and writing the rejected data to a reject file.
- 18 . The system of claim 17 , wherein the sequence of the plurality of modular operation-specific components comprises: reformatting the account data of the table data structure using a first reformat component to produce first reformatted data; retrieving data from one or more columns of an input file using the filtered data and the first reformatted data; and extracting a set of fields from the data in the input file.
- 19 . The system of claim 18 , wherein the sequence of the plurality of modular operation-specific components further comprises: reformatting the account data of the second data table structure using a second reformat component to produce second reformatted data; replicating the second reformatted data using a replicate component to produce first and second copies of the second reformatted data; and joining the first reformatted data and the first copy of the second reformatted data using a first join component to produce the joined data, wherein the updated data is based upon a comparison of the account data associated with the current day in the first reformatted data and the account data associated with the previous day in the first copy of the second reformatted data.
- 20 . A non-transitory computer-readable medium storing instructions which, when executed by one or more processors of a system, cause the system to perform operations comprising: receiving a data retrieval request specifying: at least one source database from which to retrieve account data and at least one target database to which to load the account data; identifying at least one source data type associated with the account data in the at least one source database; identifying at least one target data type associated with the at least one target database; determining a particular transformation graph from a library of transformation graphs associated with the at least one source data type and the at least one target data type, the particular transformation graph comprising a plurality of modular operation-specific components representing a plurality of modular transformation operations; wherein the particular transformation is configured to transform the account data of the at least one source data type to the at least one target data type; wherein the plurality of modular operation-specific components are configured to be deployable across a plurality of different types of database infrastructures; wherein each modular operation-specific component of the plurality of modular operation-specific components comprises a distinct software package configured to perform at least one transformation operation of the plurality of transformation operations; generating a transformation workflow associated with the data retrieval request based at least in part on the plurality of modular operation-specific components; wherein the transformation workflow specifies a sequence of the plurality of modular operation-specific components to perform the plurality of transformation operations so as to transform the account data from at least one source data format to the at least one target data format; determine execution scheduling for each modular operation-specific component of the plurality of modular operation-specific components based at least in part on the transformation workflow; request a cloud-based host system to instantiate a plurality of containers for the plurality of modular operation-specific components, wherein each respective container is configured to establish a separate execution environment with a separate processing thread for each respective modular operation-specific component; receiving at least one modification to the plurality of modular operation-specific components in the transformation workflow; modifying at least on modular operation-specific component of the plurality of modular operation specific components while executing the transformation workflow, wherein executing the transformation workflow comprising, at each transformation operation of the plurality of transformation operations in the executing scheduling, invoking, via the cloud-based host system, each respective container to execute each respective modular operation-specific component of the plurality of modular operation-specific components with the at least one modification to cause at least one processing resource to transform the account data of a data table structure as part of the plurality of transformation operations performed by the plurality of modular operation-specific components, wherein the plurality of transformation operations comprises: identifying updated data based upon a comparison of a first subset of the account data associated with a current day and with a second subset of the account data associated with a previous day; combining the updated data with new data corresponding to at least one new account to produce combined data; and joining the combined data with a copy of the account data of a second table data structure to produce joined data.
Description
RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 16/109,106 filed on Aug. 22, 2018. The contents of the above application is incorporated by reference as if fully set forth herein in its entirety. COPYRIGHT NOTICE A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the United States Patent and Trademark Office patent file or records but otherwise reserves all copyright rights whatsoever. TECHNICAL FIELD This disclosure relates to a system that performs data transformations, and more particularly, a system that performs data transformations using a set of independent software components. BACKGROUND When managing database systems, it is often necessary to transfer data between heterogeneous systems. To facilitate such a process, various tools have been developed to extract, transform, and load (ETL) data from a source database to a target database. These tools, however, are often specialized and configured for particular proprietary database systems. Accordingly, developers often require specific knowledge of the proprietary systems, which limits the ability to update and scale such systems. Moreover, such ETL tools often require particular hardware requirements with minimal configurability. Accordingly, currently available ETL tools are not well-suited for current development architectures that require customization and scalability. Accordingly, there is a need for a framework that provides ETL tools that are flexible and deployable across various types of infrastructures. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate implementations of the disclosure and together with the description, serve to explain the principles of the disclosure. FIG. 1 is a block diagram illustrating an example operating environment for performing data transformations using a set of components according to one or more implementations of the disclosure. FIG. 2 is a block diagram illustrating an example set of components for performing data transformations according to one or more implementations of the disclosure. FIG. 3 is a process flow diagram illustrating an example of a process for performing data transformations according to one or more implementations of the disclosure. FIG. 4 is a process flow diagram illustrating an example of a process for executing a set of components according to one or more implementations of the disclosure. FIG. 5 is a diagram illustrating an example of various transformation operations that may be performed by a set of transformation components according to one or more implementations of the disclosure. FIG. 6 is a diagram illustrating an example transformation graph of a set of components representing one or more sequences of operations to perform data transformations according to one or more implementations of the disclosure. FIG. 7 is a flow diagram illustrating an example of a method for performing data transformations according to one or more implementations of the disclosure. FIG. 8 is a block diagram illustrating an example of a computing system that may be used in conjunction with one or more implementations of the disclosure. DETAILED DESCRIPTION Various implementations and aspects of the disclosures will be described with reference to details discussed below, and the accompanying drawings will illustrate the various implementations. The following description and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various implementations of the present disclosure. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of implementations of the present disclosure. Reference in the specification to “one implementation” or “an implementation” or “some implementations,” means that a particular feature, structure, or characteristic described in conjunction with the implementation can be included in at least one implementation of the disclosure. The appearances of the phrase “implementation” in various places in the specification do not necessarily all refer to the same implementation. In some implementations, described is a system (and method) that provides a framework for performing data transformations, which may be part of an Extract, Transform, and Load (ETL) process. The system may perform data transformations by creating and executing one or more sequences of operations using a preconfigured set of independent software components (or plugins). In one implementation, such components may be executed as individual services (e.g., microservices) that may be provided w