Search

WO-2026096598-A1 - DATA PROCESSING SYSTEM FOR AUTOMATIC PROCESSING OF CONTINUOUS FLOWS OR BATCH DATA

WO2026096598A1WO 2026096598 A1WO2026096598 A1WO 2026096598A1WO-2026096598-A1

Abstract

Techniques for executing a data processing application in an environment in which there can be a plurality of data sources including continuous data sources and batch data sources. A data processing application may be representable as a plurality of input nodes and a plurality of processing nodes. The techniques include: for a node of the plurality of processing nodes having a first input configured at the time of execution of the application to receive batch data and a second input configured to receive continuous data: computing first data by executing data processing operations of the data processing application between the first input of the node and one or more data sources of the plurality of data sources on data from the one or more data sources; and storing the first data; and configuring the data processing system to, when executing the data processing application, use the stored first data as the first input to the node.

Inventors

  • Wholey, Joseph, Skeffington

Assignees

  • AB INITIO TECHNOLOGY LLC
  • AB INITIO ORIGINAL WORKS LLC
  • AB INITIO SOFTWARE LLC

Dates

Publication Date
20260507
Application Date
20251029
Priority Date
20241030

Claims (20)

  1. 1. A method, performed by a data processing system, for executing a data processing application in an environment in which there can be a plurality of data sources including continuous data sources and batch data sources, the data processing application representable as a plurality of input nodes representing a plurality of input data sources and a plurality of processing nodes representing data processing operations to be performed on data and relative ordering of the data processing operations performed on data from data sources of the plurality of data sources, the method comprising: for a node of the plurality of processing nodes having a first input configured to receive batch data and a second input configured to receive continuous data: computing first data by executing data processing operations of the data processing application between the first input of the node and one or more data sources of the plurality of data sources on data from the one or more data sources; and storing the first data; and configuring the data processing system to, when executing the data processing application, use the stored first data as the first input to the node.
  2. 2. The method of claim 1, wherein using the stored first data as the first input to the node comprises, for each of a plurality of records of the continuous data at the second input to the node, using a field value in the continuous data as a key to select a corresponding record in the stored first data.
  3. 3. The method of claim 1 or claim 2, wherein the first node is configured to receive batch data by direct or indirect upstream connections within the data processing application only to batch data sources.
  4. 4. The method of any of claims 1-3, wherein storing the first data comprises storing the first data as a file.
  5. 5. The method of any of claims 1-4, wherein the one or more data sources are all batch data sources. #14545358v1
  6. 6. The method of any of claims 1-5, wherein computing first data comprises identifying the node by searching the data processing application for nodes having a first input coupled within the data processing application directly or indirectly to only upstream data sources that are batch data sources and a second input coupled directly or indirectly to an upstream data source that is a continuous data source.
  7. 7. The method of any of claims 1-6, wherein the node represents a join operation.
  8. 8. The method of any of claims 1-7, wherein the data processing application is formatted as a data flow graph.
  9. 9. The method of any of claims 1-8, wherein the acts of computing and storing are performed for each of a plurality of nodes of the data processing application having a first input configured to receive batch data and a second input configured to receive continuous data.
  10. 10. The method of claim any of claims 1-9, wherein: the method is performed at a first time for execution of the data processing application when an upstream batch data source is connected, through direct or indirect upstream connections within the data processing application, to the second input; and the method is performed at a second time for execution of the data processing application when an upstream continuous data source is connected instead of the upstream batch data source, through the direct or indirect upstream connections within the data processing application, to the second input.
  11. 11. The method of claim 10, wherein the first time comprises execution of the data processing application in a development or test environment.
  12. 12. The method of claim 11, wherein the second time comprises execution of the data processing application in a production environment.
  13. 13. The method of claim 10, wherein the first time comprises execution of the data processing application on data for a finite time period and the second time comprises execution of the data processing application on data being generated in real time. #14545358v1
  14. 14. A data processing system configured to execute a data processing application in an environment in which there can be a plurality of data sources including continuous data sources and batch data sources, the data processing application representable as a plurality of input nodes representing a plurality of input data sources and a plurality of processing nodes representing data processing operations to be performed on data and relative ordering of the data processing operations performed on data from data sources of the plurality of data sources, the data processing system configured to perform a method of any of claims 1-13.
  15. 15. At least one non-transitory computer-readable storage medium storing instructions, that when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method of any of claims 1-13.
  16. 16. A method, performed by a data processing system, for executing a data processing application, the data processing application comprising one or more input nodes representing one or more input data sources, one or more output nodes representing one or more output data stores, and a plurality of nodes representing a sequence of data processing operations to be performed on data, the method comprising: determining whether at least a first input data source of the one or more input data sources of the data processing application is a continuous input data source; based on determining that the first input data source of the one or more input data sources is a continuous input data source: identifying a downstream portion of the data processing application, downstream from the continuous input data source, that is configured to operate on continuous data originating from the continuous input data source; and for a first node in the downstream portion of the data processing application that represents an operation to be performed on the continuous data originating from the continuous input data source and batch data originating from a batch input data source, generating a first lookup data structure by processing a first portion of the data processing application that is configured to operate on the batch data originating from the batch input data source; and transforming the data processing application, wherein the transforming comprises configuring the first node representing the operation to use the first lookup data structure as input during execution of the data processing application. #14545358v1
  17. 17. The method of claim 16, wherein identifying a downstream portion of the data processing application comprises: storing, in a data store, one or more labels identifying one or more nodes of the data processing application downstream from the continuous input data source as continuous components.
  18. 18. The method of claim 16 or claim 17, wherein generating the first lookup file by processing the first portion of the data processing application comprises: identifying one or more nodes of the data processing application upstream from the first node that are configured to operate on the batch data originating from the batch input data source; and generating the first lookup data structure by processing the one or more nodes upstream from the first node.
  19. 19. The method of any of claims 16-18, further comprising: storing, in computer readable media, the first lookup data structure for use during execution of the data processing application.
  20. 20. The method of any of claims 16-19, further comprising: for a second node in the downstream portion of the data processing application that represents an operation to be performed on the continuous data originating from the continuous input data source and batch data originating from a second batch input data source, generating a second lookup data structure by processing a second portion of the data processing application that is configured to operate on the batch data originating from the second batch input data source.

Description

DATA PROCESSING SYSTEM FOR AUTOMATIC PROCESSING OF CONTINUOUS FLOWS OR BATCH DATA CROSS REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/713,982, filed on October 30, 2024, entitled “DATA PROCESSING SYSTEM FOR AUTOMATIC PROCESSING OF CONTINUOUS FLOWS OR BATCH DATA,” which is incorporated by reference herein in its entirety. FIELD [0002] The disclosure herein relates to a data processing systems and methods performed by the data processing systems that automatically adapt execution of a data processing application based on whether the types of input data sources are the same or different. BACKGROUND [0003] Modem data processing systems manage vast amounts of data (e.g., millions, billions, or trillions of data records) and manage how these data may be accessed (e.g., created, updated, read, or deleted). A large institution (e.g., a multinational bank, global technology company, etc.) may have millions of datasets. For example, the datasets may store transaction records, documents, tables, files, or any other suitable type of data. The data sets can be batch data that is stored in memory and has a finite beginning and finite end (e.g., data stored in files) or continuous data that is a stream of data values (e.g., data in queues and Kafka event streams), which may have no predefined ending and may be generated in response to events. [0004] A data processing system may store “metadata,” which is data that contains information about other data (e.g., stored in the same data processing system and/or another data processing system) and/or processes (e.g., in the same data processing system and/or another data processing system). For example, a data processing system may store metadata about data stored in a table or obtained from a continuous source. Non-limiting examples of such metadata include information indicating that the data source is a continuous data source or a batch data source. Metadata may also include, for batch data stored in a table for example, the size of the table in memory, when the table was created, when the table was last updated, the number of rows and/or columns in the table, where the table is stored, who has permission to read, update, delete and/or perform any other suitable action(s) with respect to the table. [0005] A data processing system may execute data processing applications to support various functions. Data processing applications may be used to provide functions that support #14545358vl processes of an institution. The data processing applications may perform operations on datasets as part of executing such functions. For example, data processing applications may perform operations on batch data, such as a database of sensors or customers of an enterprise, or continuous data, such as measurements output by the sensors or transactions performed by the customers. [0006] Typically, different data processing applications are written to process batch and continuous data. A developer writes code for the different data processing applications based on the type of data they are designed to process. For example, a developer writes code for a first data processing application designed to process batch data and code for a second data processing application designed to process continuous data. The code for the different data processing applications has to be maintained across different environments that the data processing system operates in (e.g., development, test, and production environments). Maintaining, compiling and executing multiple different data processing applications across different environments requires significant computing resources, such as storage or processing resources. SUMMARY [0007] Some embodiments provide a method, performed by a data processing system, for executing a data processing application in an environment in which there can be a plurality of data sources including continuous data sources and batch data sources, the data processing application representable as a plurality of input nodes representing a plurality of input data sources and a plurality of processing nodes representing data processing operations to be performed on data and relative ordering of the data processing operations performed on data from data sources of the plurality of data sources, the method comprising: for a node of the plurality of processing nodes having a first input configured to receive batch data and a second input configured to receive continuous data: computing first data by executing data processing operations of the data processing application between the first input of the node and one or more data sources of the plurality of data sources on data from the one or more data sources; and storing the first data; and configuring the data processing system to, when executing the data processing application, use the stored first data as the first input to the node. [0008] Some embodim