Search

US-20260127003-A1 - DEFINING AND INCORPORATING REUSABLE AGGREGATE COMPONENTS IN DATA PIPELINES

US20260127003A1US 20260127003 A1US20260127003 A1US 20260127003A1US-20260127003-A1

Abstract

A data pipeline orchestration tool facilitates the creation of re-useable aggregate components that can be created during building of new data pipelines or integrated into existing data pipelines and manages the placement and connection of aggregate components within an existing pipeline. The aggregate components utilize virtual components as connectors to other data pipeline components, which include a virtual source and/or a virtual sink. The virtual source maps a specific schema of a source of data to be processed by the aggregate component, such as specific fields of a database table, with a generic schema of the virtual source. The virtual sink designates the flow of data from the aggregate component to one or more downstream components of the data pipeline. Aggregate components can also be stored in a component library or other component storage offered by the data pipeline orchestration tool to facilitate reuse of aggregate components across data pipelines.

Inventors

  • Ian Clive Funnell
  • Shane Paul Darren Booth

Assignees

  • Matillion Limited

Dates

Publication Date
20260507
Application Date
20241101

Claims (20)

  1. 1 . A method comprising: defining an aggregate component that is reusable across extract, transform, load (ETL) data pipelines or extract, load, transform (ELT) data pipelines, wherein the aggregate component comprises a plurality of data pipeline components, wherein defining the aggregate component comprises, detecting selection of at least a first data pipeline component and second data pipeline component for inclusion in the aggregate component; creating a first virtual component and connecting an output of the first virtual component to an input of the first data pipeline component; configuring a schema of the first virtual component, wherein the schema of the first virtual component indicates a number of expected data fields from which to obtain data as input to the aggregate component; and creating a second virtual component and connecting an output of the second data pipeline component to an input of the second virtual component, wherein the second virtual component binds an output of the second data pipeline component to a downstream data pipeline component; and storing the aggregate component for incorporation into data pipelines.
  2. 2 . The method of claim 1 , further comprising inserting the aggregate component into a first data pipeline, wherein the first data pipeline comprises one or more data pipeline components.
  3. 3 . The method of claim 2 , further comprising obtaining a configuration of the aggregate component based on insertion of the aggregate component into the first data pipeline, wherein the configuration of the aggregate component comprises an indication of a data source from which to source data for input to the first data pipeline component and one or more data fields of the data source, wherein a number of the one or more data fields indicated in the configuration matches the number of expected data fields indicated in the schema of the first virtual component, and wherein inserting the aggregate component into the first data pipeline comprises mapping the one or more data fields of the data source to respective ones of the expected data fields.
  4. 4 . The method of claim 3 , wherein the data source corresponds to a different data pipeline component of the one or more data pipeline components, wherein the different data pipeline component is upstream of the aggregate component in the first data pipeline.
  5. 5 . The method of claim 3 , wherein the number of data fields indicated in the configuration is all data fields of the data source and the configuration of the aggregate component indicates all data fields of the data source, and wherein inserting the aggregate component into the first data pipeline comprises generating a query to retrieve data from each data field of the data source as input to the first data pipeline component.
  6. 6 . The method of claim 3 , wherein the number of expected data fields indicated in the configuration is one or more and the configuration of the aggregate component indicates one or more data fields of the data source, wherein the one or more data fields of the data source are a subset of all data fields of the data source, and wherein inserting the aggregate component into the first data pipeline comprises generating a query that binds each of the one or more data fields of the data source to a respective alias and retrieves data from each of the one or more data fields as input to the first data pipeline component.
  7. 7 . The method of claim 3 , wherein inserting the aggregate component into the first data pipeline comprises connecting the second virtual component to a different data pipeline component of the one or more data pipeline components, wherein the different data pipeline component is downstream of the aggregate component in the first data pipeline, and wherein connecting the second virtual component to the different data pipeline component comprises binding an output of the aggregate component to an input of the different data pipeline component via one or more aliases generated by the second virtual component.
  8. 8 . The method of claim 1 , wherein detecting selection of the first and second data pipeline components is based on detecting at least a first input into a graphical user interface (GUI) comprising the selection of the first and second data pipeline components.
  9. 9 . The method of claim 1 , wherein defining the aggregate component further comprises configuring the aggregate component with a default data source indicated in an obtained configuration of the aggregate component, wherein the obtained configuration indicates the default data source and one or more data fields of the default data source, wherein the one or more, wherein the one or more data fields of the default data source matches the number of expected data fields.
  10. 10 . The method of claim 1 , wherein storing the aggregate component comprises storing the aggregate component in a library of data pipeline components for building data pipelines.
  11. 11 . One or more non-transitory machine-readable media having program code stored thereon, the program code comprising instructions to: detect selection of a plurality of data pipeline components for inclusion in an aggregate data pipeline component, wherein the plurality of data pipeline components comprises an upstream data pipeline component and a downstream data pipeline component; create a virtual source component of the aggregate data pipeline component, wherein the instructions to create the virtual source component comprise instructions to, instantiate the virtual source component; connect an output of the virtual source component to an input of the upstream data pipeline component; and generate a configuration of the virtual source component based on configuration of a schema of the virtual source component, wherein the configuration of the virtual source component indicates an expected number of columns from which to obtain data as input to the virtual source component; create a virtual sink component of the aggregate data pipeline component, wherein the instructions to create the virtual sink component comprise instructions to, instantiate the virtual sink component; and connect an output of the downstream data pipeline component to an input of the virtual sink component; and store the aggregate data pipeline component for incorporation into one or more data pipelines, wherein the one or more data pipelines are extract, transform, load (ETL) data pipelines or extract, load, transform (ELT) data pipelines.
  12. 12 . The non-transitory machine-readable media of claim 11 , wherein the program code further comprises instructions to insert the aggregate data pipeline component into a first data pipeline, wherein the instructions to insert the aggregate data pipeline component into the first data pipeline comprise instructions to, obtain a configuration of the aggregate data pipeline component indicating a table from which to retrieve data as input and one or more columns of the table, wherein a number of the one or more columns matches the expected number of columns; and for each of the one or more columns of the table, map the column to an indication of an expected column from which to obtain data as input to the virtual source component.
  13. 13 . The non-transitory machine-readable media of claim 12 , wherein the instructions to insert the aggregate data pipeline component into the first data pipeline comprise instructions to, based on a determination that the configuration of the aggregate data pipeline component indicates all columns of the table, generate a query to retrieve data from each column of the table; and based on a determination that the configuration of the aggregate data pipeline component indicates a subset of columns of the table, generate a query that binds each of the subset of columns to a respective alias and retrieves data from each of the subset of columns and retrieves data from others of the columns of the table without aliases.
  14. 14 . The non-transitory machine-readable media of claim 11 , wherein the instructions to detect selection of the plurality of data pipeline components comprise instructions to detect at least a first input into a graphical user interface (GUI) comprising the selection of the plurality of data pipeline components.
  15. 15 . An apparatus comprising: a processor; and a machine-readable medium having instructions stored thereon that are executable by the processor to cause the apparatus to, create an aggregate component that is reusable across extract, transform, load (ETL) data pipelines or extract, load, transform (ELT) data pipelines, wherein the aggregate component comprises a plurality of components, wherein the instructions to create the aggregate component comprise instructions to, detect selection of a set of components for inclusion in the aggregate component, wherein the set of components comprises a first component and second component; create a first virtual component and connect an output of the first virtual component to an input of the first component; configure a schema of the first virtual component, wherein the schema of the first virtual component indicates a number of expected data fields from which to obtain input to the aggregate component; and create a second virtual component and connect an output of the second component to an input of the second virtual component; and incorporate the aggregate component into a first data pipeline.
  16. 16 . The apparatus of claim 15 , wherein the first data pipeline comprises a plurality of components, wherein the instructions executable by the processor to cause the apparatus to detect selection of the set of components comprise instructions executable by the processor to cause the apparatus to detect selection of the set of components from the plurality of components, and wherein the instructions executable by the processor to cause the apparatus to incorporate the aggregate component into the first data pipeline comprise the instructions executable by the processor to cause the apparatus to replace the set of components in the first data pipeline with the aggregate component.
  17. 17 . The apparatus of claim 15 , wherein the instructions executable by the processor to cause the apparatus to incorporate the aggregate component into the first data pipeline comprise instructions executable by the processor to cause the apparatus to, based on obtaining a configuration of the aggregate component that comprises an indication of a data source and one or more data fields of the data source, generate a first query to retrieve data from the one or more data fields of the data source; and map the one or more data fields of the data source to respective ones of the expected data fields, wherein a number of the one or more data fields matches the number of the expected data fields.
  18. 18 . The apparatus of claim 17 , wherein the instructions executable by the processor to cause the apparatus to generate the first query comprise instructions executable by the processor to cause the apparatus to, based on a determination that the configuration of the aggregate component indicates all data fields of the data source, generate a query to retrieve data from each data field of the data source; and based on a determination that the configuration of the aggregate component indicates a subset of data fields of the data source, generate a query that binds each of the subset of data fields to a respective alias and retrieves data from each of the subset of data fields and retrieves data from others of the data fields of the data source without aliases.
  19. 19 . The apparatus of claim 17 , wherein the instructions executable by the processor to cause the apparatus to generate the first query comprise instructions executable by the processor to cause the apparatus to generate a Structured Query Language (SQL) query comprising a SELECT statement.
  20. 20 . The apparatus of claim 15 , further comprising instructions executable by the processor to cause the apparatus to store the aggregate component in a library of data pipeline components for building data pipelines.

Description

BACKGROUND The disclosure generally relates to digital data processing and information retrieval (e.g., CPC subclass G06F/00) and ETL procedures (e.g., CPC subclass CPC G06F/254). ETL (extract, transform, load) is a data integration process that was introduced in the 1970s. The ETL process extracts data from multiple data sources, cleans and organizes (i.e., transforms) the extracted data for the intended use and/or target system, and loads the transformed data into a target system (e.g., data warehouse or data lake). ELT (extract, load, transform) is a similar data integration process that defers transformation until after the extracted raw data has been loaded into the target system. The rise of cloud computing has introduced “ETL pipelines” or “data pipelines.” ETL pipeline refers to the implementations or collection of processes and tools for ETL in a cloud computing environment that involves not only multiple data sources but heterogeneous data sources. In some cases, “cloud ETL” or “cloud ELT” is used instead of data pipeline. While “data pipeline” and “ETL pipeline” are sometimes used interchangeably, some use “data pipeline” to refer more specifically to a data integration process that includes streaming data sources or “real-time” data sources. However, it is more common for data pipelines to refer to the processes and tools that collectively implement ETL regardless of the data sources being streamed or “real-time” data sources. “Data pipeline” suggests the flow of data over a pipeline from sources, through a series of processing steps or components that implement the processing steps, to a destination or sink. BRIEF DESCRIPTION OF THE DRAWINGS Embodiments of the disclosure may be better understood by referencing the accompanying drawings. FIG. 1 is a conceptual diagram of integrating an aggregate component into a data pipeline being built. FIG. 2 is a conceptual diagram depicting the creation of a reuseable aggregate component. FIG. 3 is a flowchart of example operations for incorporating an aggregate component into a data pipeline. FIG. 4 is an example of operations to build an aggregate component using generic data pipeline components from a library of generic components. FIG. 5 is an example of operations for building an aggregate component using data pipeline components from an existing data pipeline. FIG. 6 depicts an example computer system with an aggregate component manager. DESCRIPTION The description that follows includes example systems, methods, techniques, and program flows to aid in understanding the disclosure and not to limit claim scope. Well-known instruction instances, protocols, structures, and techniques have not been shown in detail for conciseness. Terminology This description uses the term “aggregate component” to refer to a collection of components of a data pipeline. The aggregate component comprises one or more data pipeline components that perform respective operations for ELT/ETL. Data pipeline components included in an aggregate component can perform but are not limited to aggregation transformations, as aggregate components can comprise any types of data pipeline components offered by a data pipeline orchestration platform. Use of the phrase “at least one of” preceding a list with the conjunction “and” should not be treated as an exclusive list and should not be construed as a list of categories with one item from each category, unless specifically stated otherwise. A clause that recites “at least one of A, B, and C” can be infringed with only one of the listed items, multiple of the listed items, and one or more of the items in the list and another item not listed. Overview Data pipeline orchestration tools have been integrated into software developer workflows to increase their productivity and efficiency. These data pipeline orchestration tools allow developers to abstract the logic of languages used as part of data pipeline creation, like structured query language (SQL) into a graphical user interface (GUI) depiction and simplify ELT/ETL creation for end users. However, these abstractions of lower-level languages, such as SQL, have constrained developers to work within specific schemas of their databases, making it difficult to re-use generic data transformation logic across multiple data pipelines. Disclosed herein is a data pipeline orchestration tool that facilitates the creation of re-useable aggregate components that can be created during building of new data pipelines or integrated into existing data pipelines. With this data pipeline orchestration tool, these aggregate components can be placed within a data pipeline and connected to the logic therein. To accomplish this connection, these aggregate components utilize virtual components as connectors to other components of data pipelines, which include a virtual source and/or a virtual sink. The virtual source maps a specific schema of a source of data to be processed by the aggregate component, such as spe