US-12619623-B2 - On-demand retrieval of structured data in aggregating data across distinct sources
Abstract
A method for enabling a user to generate a complex aggregation on their own by providing the user with a graphical user interface that displays data items in a data catalog and that provides controls for the user to select data items to be used in generating the complex aggregation, and to select a type of aggregation, and based on the user's selections, automatically generating computer instructions to generate a value of the complex aggregation is described.
Inventors
- Joel Gould
Assignees
- AB INITIO TECHNOLOGY LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20230620
Claims (20)
- 1 . A method implemented by a data processing system for providing a user with a graphical user interface that displays data items in a data catalog and that provides controls for the user to select data items to be used in generating one or more aggregations, wherein the controls also enable the user to select a type of aggregation, and based on user's selections, automatically generating computer instructions to generate a value of the one or more aggregations that the user had selected, the method including: accessing, from a data catalog, names of fields of datasets of a plurality of data sources, with a dataset including multiple data elements, each data element structured with fields; displaying a graphical user interface that provides first visual representations of at least a plurality of the names of the fields of the datasets of the plurality of data sources accessed from the data catalog, with the at least the plurality of the names representing candidate inputs for defining an aggregation; displaying, in the graphical user interface, one or more first controls for specifying which of the candidate inputs represented by the at least the plurality of the names accessed from the data catalog are selected for defining the aggregation; wherein a first control specifies which of the candidate inputs are selected by enabling selection of a first visual representation, wherein a selected, first visual representation specifies a name of a field of a dataset of the plurality of data sources selected for defining the aggregation; identifying first visual representations selected by at least one of the one or more first controls; identifying, by a data processing system executing logic, names of fields represented by the selected first visual representations; based on the identified names of fields, determining one or more types of aggregations that are permissible for aggregating data items associated with the identified names of the fields, wherein a type of aggregation is associated with one or more names of one or more fields; displaying, in the graphical user interface, one or more second visual representations of the one or more types of aggregations that are permissible for aggregating the data items; displaying, in the graphical user interface, one or more second controls for specifying which type of aggregation is selected; wherein a second control specifies which type of aggregation is selected by enabling selection of a second visual representation; whereby the graphical user interface provides for selection of one or more resources; in response to selection of at least one of the resources, displaying, in the graphical user interface, one or more third visual representations of fields associated with the at least one of the resources selected; based on first visual representations selected by at least one of the one or more first controls, one or more second visual representations selected by at least one of the one or more second controls and at least a selected one of the one or more third visual representations, generating, by the data processing system, computer instructions that are executable to: detect data items identified by one or more of the names represented by the selected first visual representations; and based on the detected data items and the at least selected one of the one or more third visual representations, generate one or more values of one or more aggregations of one or more types represented by one or more selected second visual representations; and storing, in memory, the computer instructions.
- 2 . The method of claim 1 , further including: based on the first visual representations selected by the at least one of the one or more first controls and the one or more second visual representations selected by the at least one of the one or more second controls, generating a definition of the aggregation; wherein the definition specifies names represented by the selected first visual representations, and wherein the definition specifies a type represented by the one or more selected second visual representations.
- 3 . The method of claim 1 , wherein the detecting of data items identified by one or more names represented by the selected first visual representations includes detecting a first item from a first data source and a second data item from a second data source, wherein the first and second data items are identified by one or more names represented by the selected first visual representations, and wherein the first and second data sources are distinct and/or different data sources; and wherein the generating of the one or more values of the one or more aggregations of the one or more types represented by the one or more selected second visual representations includes generating the values of the aggregations of the types represented by the selected second visual representations based on the detected first and second data items.
- 4 . The method of claim 1 , the generating of the computer instructions further including: generating a first transform based on the selected first visual representations, wherein the first transform is configured to be inserted into one or more placeholders in one or more pre-configured templates of one or more computation graphs.
- 5 . The method of claim 4 , the generating of the computer instructions further including: generating a second transform based on the selected second visual representation, wherein the second transform is configured to be inserted into a placeholder in the one or more pre-configured templates of the one or more computation graphs.
- 6 . The method of claim 5 , the generating of the computer instructions further including: inserting the first transform and the second transform into the respective placeholders in the one or more pre-configured templates of the one or more computation graphs for generating the aggregation.
- 7 . The method of claim 4 , wherein the one or more templates of the one or more computation graphs include a template batch graph, wherein the template batch graph includes a placeholder for insertion of the first transform such that the template batch graph with the first transform inserted into the placeholder of the template batch graph is configured to perform, in predetermined time intervals, batch retrieval from disk of data items used for the one or more aggregates.
- 8 . The method of claim 7 , wherein the batch retrieval from disk of the data items is performed by querying the data items in a single query from the disk.
- 9 . The method of claim 7 , wherein the one or more templates of the one or more computation graphs include a template real-time graph, and wherein the template real-time graph includes a placeholder for insertion of the first transform, such that the template real-time graph with the first transform inserted into the placeholder of the template real-time graph is configured to perform real-time retrieval from memory of data items used for the one or more aggregates.
- 10 . The method of claim 9 , wherein the memory is volatile memory.
- 11 . The method of claim 9 , wherein the real-time retrieval from memory of the data items is performed by querying the data items in a single query from the memory.
- 12 . The method of claim 9 , wherein the one or more templates of the one or more computation graphs further include a template aggregate graph, which includes a placeholder for insertion of the second transform.
- 13 . The method of claim 12 , wherein the template aggregate graph is connected with the output of the template batch graph and the output of the template real-time graph such that, with the second transform inserted into the placeholder of the template aggregate graph, results of the batch retrieval are supplemented with results from the real-time retrieval to generate one or more values of one or more aggregations of one or more types represented by one or more selected second visual representations.
- 14 . The method of claim 1 , further including: in response to receiving a request for the aggregate, executing the stored computer instructions to detect data items identified by one or more names represented by the selected first visual representations and to generate the one or more values of the one or more aggregations of the one or more types represented by the one or more selected second visual representations based on the detected data items.
- 15 . The method of claim 14 , further including: wherein the one or more aggregates include multiple aggregates, and the computer instructions are configured such that the data items used for the multiple aggregates are to be queried in a single query from one or more data storages.
- 16 . The method of claim 1 , wherein the displaying, in the graphical user interface, of the one or more second visual representations of the one or more types of aggregations that are permissible for aggregating the data items is based on the selected first visual representation, preferably such that only second visual representations of types of aggregations that are permissible for aggregating the data items identified by the names identifier that are specified by the selected first visual representation are displayed in the graphical user interface.
- 17 . The method of claim 1 , further including: displaying, in the graphical user interface, one or more third controls for specifying a duration over which the aggregation is generated.
- 18 . The method of claim 17 , further including: displaying, in the graphical user interface, one or more fourth controls for specifying an event type over which the aggregation is generated.
- 19 . The method of claim 1 , wherein the data catalog is distinct from the plurality of data sources.
- 20 . A data processing system for providing a user with a graphical user interface that displays data items in a data catalog and that provides controls for the user to select data items to be used in generating one or more aggregations, wherein the controls also enable the user to select a type of aggregation, and based on user's selections, automatically generating computer instructions to generate a value of the one or more aggregations that the user had selected, the data processing system including: one or more processors; and one or more machine-readable hardware storage devices storing instructions that are executable by the one or more processors to perform actions of: accessing, from a data catalog, names of fields of datasets of a plurality of data sources, with a dataset including multiple data elements, each data element structured with fields; displaying a graphical user interface that provides first visual representations of at least a plurality of the names of the fields of the datasets of the plurality of data sources accessed from the data catalog, with the at least the plurality of the names representing candidate inputs for defining an aggregation; displaying, in the graphical user interface, one or more first controls for specifying which of the candidate inputs represented by the at least the plurality of the names accessed from the data catalog are selected for defining the aggregation; wherein a first control specifies which of the candidate inputs are selected by enabling selection of a first visual representation, wherein a selected, first visual representation specifies a name of a field of a dataset of the plurality of data sources selected for defining the aggregation; identifying first visual representations selected by at least one of the one or more first controls; identifying, by a data processing system executing logic, names of fields represented by the selected first visual representations; based on the identified names of fields, determining one or more types of aggregations that are permissible for aggregating data items associated with the identified names of the fields, wherein a type of aggregation is associated with one or more names of one or more fields; displaying, in the graphical user interface, one or more second visual representations of the one or more types of aggregations that are permissible for aggregating the data items; displaying, in the graphical user interface, one or more second controls for specifying which type of aggregation is selected; wherein a second control specifies which type of aggregation is selected by enabling selection of a second visual representation; whereby the graphical user interface provides for selection of one or more resources; in response to selection of at least one of the resources, displaying, in the graphical user interface, one or more third visual representations of fields associated with the at least one of the resources selected; based on first visual representations selected by at least one of the one or more first controls, one or more second visual representations selected by at least one of the one or more second controls and at least a selected one of the one or more third visual representations, generating, by the data processing system, computer instructions that are executable to: detect data items identified by one or more of the names represented by the selected first visual representations; and based on the detected data items and the at least selected one of the one or more third visual representations, generate one or more values of one or more aggregations of one or more types represented by one or more selected second visual representations; and storing, in memory, the computer instructions.
Description
CLAIM OF PRIORITY This application claims priority under 35 U.S.C. § 119(e) to U.S. Patent Application Ser. No. 63/481,488, filed on Jan. 25, 2023, the entire contents of which are hereby incorporated by reference. BACKGROUND This disclosure relates to techniques for efficiently operating a data processing system with a large number of datasets that may be stored in any of a large number of data stores. Modern data processing systems manage vast amounts of data within an enterprise. A large institution, for example, may have millions of datasets. These datasets can support multiple aspects of the operation of the enterprise. Complex data processing systems typically process data in multiple stages, with the results produced by one stage being fed into the next stage. The overall flow of information through such systems may be described in terms of a directed dataflow graph, with nodes or vertices in the graph representing components (either data files or processes), and the links or “edges” in the graph indicating flows of data between the components. A system for executing such graph-based computations is described in U.S. Pat. No. 5,966,072, titled “Executing Computations Expressed as Graphs,” incorporated herein by reference. Graphs also can be used to invoke computations directly. Graphs made in accordance with this system provide methods for getting information into and out of individual processes represented by graph components, for moving information between the processes, and for defining a running order for the processes. Systems that invoke these graphs include algorithms that choose inter-process communication methods and algorithms that schedule process execution, and also provide for monitoring of the execution of the graph. To support a wide range of functions, a data processing system may execute applications, whether to implement routine processes or to extract insights from the datasets. The applications may be programmed to access the data stores to read and write data. SUMMARY In a general aspect 1, described is a method implemented by a data processing system for providing a user with a graphical user interface that displays data items in a data catalog and that provides controls for the user to select data items to be used in generating one or more aggregations, wherein the controls also enable the user to select a type of aggregation, and based on user's selections, automatically generating computer instructions to generate a value of the one or more aggregations that the user had selected, the method including: accessing identifiers of a plurality of data items from a data catalog; displaying a graphical user interface that provides first visual representations of the identifiers accessed from the data catalog, with the identifiers representing candidate inputs for defining an aggregation; displaying, in the graphical user interface, one or more first controls for specifying which of the candidate inputs are selected for defining the aggregation; wherein a first control specifies which of the candidate inputs are selected by enabling selection of a first visual representation, wherein a selected, first visual representation specifies an identifier selected for defining the aggregation; displaying, in the graphical user interface, one or more second visual representations of one or more types of aggregations that are permissible for aggregating the data items; displaying, in the graphical user interface, one or more second controls for specifying which type of aggregation is selected; wherein a second control specifies which type of aggregation is selected by enabling selection of a second visual representation; based on first visual representations selected by at least one of the one or more first controls and one or more second visual representations selected by at least one of the one or more second controls, generating, by the data processing system, computer instructions that are executable to: detect data items identified by one or more identifiers represented by the selected first visual representations; and based on the detected data items, generate one or more values of one or more aggregations of one or more types represented by one or more selected second visual representations; and storing, in memory, the computer instructions. In an aspect 2 according to aspect 1, the method further includes: based on the first visual representations selected by the at least one of the one or more first controls and the one or more second visual representations selected by the at least one of the one or more second controls, generating a definition of the aggregation; wherein the definition specifies identifiers represented by the selected first visual representations, and wherein the definition specifies a type represented by the one or more selected second visual representations. In an aspect 3 according to any one of aspects 1 to 2, the detecting of data items identified by one or more