US-20260127154-A1 - LOADING DATA VIA A DATABASE SYSTEM BASED ON IMPLEMENTING A CONTINUOUS PIPELINE

US20260127154A1US 20260127154 A1US20260127154 A1US 20260127154A1US-20260127154-A1

Abstract

A database system is operable to load data for storage via the database system in conjunction with utilizing a continuous pipeline over a temporal period. An event monitor module is implemented based on executing a plurality of polls to a set of event topics of a set of other monitors over the temporal period to poll a plurality of sets of messages from the set of event topics and/or adding a plurality of file data to a table of files over the temporal period based on processing the plurality of sets of messages. A continuous pipeline task execution module is implemented to execute a continuous pipeline task based on dispersing file data of the table of files into a plurality of file work units over the temporal period and/or generating a plurality of extractor tasks to load the data for storage based on collectively processing the plurality of file work units.

Inventors

Haoxuan LI
Owen Pang
Thomas E. Smith

Assignees

Ocient Holdings LLC

Dates

Publication Date: 20260507
Application Date: 20241104

Claims (20)

1 . A method for execution by a database system, comprising: creating a continuous pipeline; loading data for storage via the database system in conjunction with utilizing the continuous pipeline over a temporal period based on: implementing an event monitor module based on: executing a plurality of polls to a set of event topics of a set of other monitors over the temporal period to poll a plurality of sets of messages from the set of event topics, wherein each poll of the plurality of polls is executed to poll a corresponding set of messages of the plurality of sets of messages from a corresponding one of the set of event topics; and adding a plurality of file data to a table of files over the temporal period based on processing the plurality of sets of messages, wherein each set of messages of the plurality of sets of messages is processed to add corresponding file data of the plurality of file data to the table of files; and implementing a continuous pipeline task execution module to execute a continuous pipeline task based on: dispersing file data of the table of files into a plurality of file work units over the temporal period; and generating a plurality of extractor tasks to load the data for storage based on collectively processing the plurality of file work units.
2 . The method of claim 1 , wherein the set of other monitors includes multiple monitors of multiple monitor types, and wherein polling the messages from the set of event topics includes interfacing with each of the multiple monitors in accordance with a corresponding protocol for a corresponding one of the multiple monitor types.
3 . The method of claim 2 , wherein interfacing with a first monitor of the set of monitors includes: executing a first subset of the plurality of polls to a corresponding first subset of the set of event topics corresponding to the first monitor, wherein each poll of the first subset of the plurality of polls is executed to poll a corresponding set of messages of a first subset of the plurality of sets of messages from a corresponding one of the corresponding first subset of the set of event topics; and after adding each corresponding file data to the table of files in response to processing each corresponding set of based messages of the first subset of the plurality of sets of messages, sending a request to the first monitor to delete the each corresponding set of messages of the first subset of the plurality of sets of messages.
4 . The method of claim 3 , wherein the set of corresponding set of messages of the first subset of the plurality of sets of messages polled via the each poll includes up to a predetermined maximum number of messages configured for interfacing with the first monitor.
5 . The method of claim 3 , wherein a predetermined visibility timeout configured for interfacing with the first monitor is applied for deleting each corresponding set of messages of the first subset of the plurality of sets of messages polled via the each poll each poll of the first subset of the plurality of polls, and wherein, when the each corresponding set of messages is not deleted within the predetermined visibility timeout, the corresponding set of messages becomes again available for polling from the corresponding one of the corresponding first subset of the set of event topics.
6 . The method of claim 3 , wherein the multiple monitor types includes: a Simple Queue Service (SQS) monitor type, wherein a first one of the multiple monitors is an SQS monitor having the SQS monitor type; and a Kafka monitor type, wherein a second one of the multiple monitors is a Kafka monitor having the Kafka monitor type.
7 . The method of claim 1 , wherein loading data for storage via the database system in conjunction with utilizing the continuous pipeline over a temporal period is further based on: deduplicating the plurality of file data based on identifying duplicate ones of the plurality of file data.
8 . The method of claim 1 , further comprising: suspending the loading of data for storage via the database system at a first time during the temporal period based on pausing utilization of the continuous pipeline at the first time; and resuming the loading of data for storage via the database system at a second time during the temporal period based on restating utilization of the continuous pipeline at the second time.
9 . The method of claim 8 , wherein resuming the loading of data for storage is based on processing a start continuous pipeline function call received in a request from a user entity.
10 . The method of claim 8 , wherein loading the data for storage via the database system in conjunction with utilizing the continuous pipeline over the temporal period is further based on: maintaining state data for the event monitor module, wherein resuming the loading of data for storage via the database system at the second time is based on accessing the state data for the event monitor module.
11 . The method of claim 10 , wherein maintaining the state data includes updating, in response to processing the each set of messages, at least one of a file count value; a file total size value; a lasted loaded offset value; or a high watermark value.
12 . The method of claim 8 , wherein the loading of data for storage via the database system is suspended at the first time in response to encountering an error.
13 . The method of claim 1 , wherein the table of files is maintained as a relational database table stored in system metadata of the database system.
14 . The method of claim 13 , further comprising maintaining a plurality of additional relational database tables in the system metadata that includes: a loading tracking table indicating at least one loading metric tracked in conjunction with loading the data; and an error tracking table indicating at least one error encountered in conjunction with loading the data.
15 . The method of claim 14 , wherein the data is loaded across a plurality of batches, wherein each batch includes a corresponding subset of the plurality of file work units and is loaded by a corresponding one of the plurality of extractor tasks, wherein the loading tracking table is populated with a first plurality of entries based on logging a corresponding entry of the first plurality of entries in response to processing each batch of the plurality of batches, and wherein the error tracking table is populated with a second plurality of entries based on logging a corresponding entry of the second plurality of entries in response encounter in loading a batch of the plurality of batches.
16 . The method of claim 1 , wherein implementing the event monitor module includes generating event notifications based on at least one of: generating a modification time-based file data listing based on filtering out file data of the plurality of file data with a last modification time outside a configured modification time range; or generating a file name-based file data listing based on sorting the file data of the plurality of file data by file name.
17 . The method of claim 1 , wherein the continuous pipeline is created in accordance with user-configured selections for a set of user-configurable parameters indicated in a continuous pipeline creation function call received in a request from a user entity.
18 . The method of claim 1 , wherein the set of user-configured selections includes at least one of: a selected monitor type for a monitor type parameter of the set of user-configurable parameters; a selected polling interval for a polling interval parameter of the set of user-configurable parameters; a selected minimum update size for a minimum update size parameter of the set of user-configurable parameters; a selected update timeout parameter for an update timeout parameter of the set of user-configurable parameters; a selected batch timeout for a batch timeout parameter of the set of user-configurable parameters; or a selected batch minimum file count for a batch minimum file count parameter of the set of user-configurable parameters.
19 . A database system includes: at least one processor; and at least one memory storing operational instructions that, when executed by the at least one processor, causes the database system to: create a continuous pipeline; load data for storage via the database system in conjunction with utilizing the continuous pipeline over a temporal period based on: implementing an event monitor module based on: executing a plurality of polls to a set of event topics of a set of other monitors over the temporal period to poll a plurality of sets of messages from the set of event topics, wherein each poll of the plurality of polls is executed to poll a corresponding set of messages of the plurality of sets of messages from a corresponding one of the set of event topics; and adding a plurality of file data to a table of files over the temporal period based on processing the plurality of sets of messages, wherein each set of messages of the plurality of sets of messages is processed to add corresponding file data of the plurality of file data to the table of files; and implementing a continuous pipeline task execution module to execute a continuous pipeline task based on: partitioning file data of the table of files into a plurality of file work units over the temporal period; and generating a plurality of extractor tasks to load the data for storage based on collectively processing the plurality of file work units.
20 . A non-transitory computer readable storage medium comprises: at least one memory section that stores operational instructions that, when executed by at least one processing module that includes a processor and a memory, causes the at least one processing module to: create a continuous pipeline; load data for storage in conjunction with utilizing the continuous pipeline over a temporal period based on: implementing an event monitor module based on: executing a plurality of polls to a set of event topics of a set of other monitors over the temporal period to poll a plurality of sets of messages from the set of event topics, wherein each poll of the plurality of polls is executed to poll a corresponding set of messages of the plurality of sets of messages from a corresponding one of the set of event topics; and adding a plurality of file data to a table of files over the temporal period based on processing the plurality of sets of messages, wherein each set of messages of the plurality of sets of messages is processed to add corresponding file data of the plurality of file data to the table of files; and implementing a continuous pipeline task execution module to execute a continuous pipeline task based on: partitioning file data of the table of files into a plurality of file work units over the temporal period; and generating a plurality of extractor tasks to load the data for storage based on collectively processing the plurality of file work units.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS None STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT Not Applicable. INCORPORATION-BY-REFERENCE OF MATERIAL SUBMITTED ON A COMPACT DISC Not Applicable. BACKGROUND OF THE INVENTION Technical Field of the Invention This invention relates generally to computer networking and more particularly to database system and operation. Description of Related Art Computing devices are known to communicate data, process data, and/or store data. Such computing devices range from wireless smart phones, laptops, tablets, personal computers (PC), work stations, and video game devices, to data centers that support millions of web searches, stock trades, or on-line purchases every day. In general, a computing device includes a central processing unit (CPU), a memory system, user input/output interfaces, peripheral device interfaces, and an interconnecting bus structure. As is further known, a computer may effectively extend its CPU by using “cloud computing” to perform one or more computing functions (e.g., a service, an application, an algorithm, an arithmetic logic function, etc.) on behalf of the computer. Further, for large services, applications, and/or functions, cloud computing may be performed by multiple cloud computing resources in a distributed manner to improve the response time for completion of the service, application, and/or function. Of the many applications a computer can perform, a database system is one of the largest and most complex applications. In general, a database system stores a large amount of data in a particular way for subsequent processing. In some situations, the hardware of the computer is a limiting factor regarding the speed at which a database system can process a particular function. In some other instances, the way in which the data is stored is a limiting factor regarding the speed of execution. In yet some other instances, restricted co-process options are a limiting factor regarding the speed of execution. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S) FIG. 1 is a schematic block diagram of an embodiment of a large scale data processing network that includes a database system in accordance with various embodiments; FIG. 1A is a schematic block diagram of an embodiment of a database system in accordance with various embodiments; FIG. 2 is a schematic block diagram of an embodiment of an administrative sub-system in accordance with various embodiments; FIG. 3 is a schematic block diagram of an embodiment of a configuration sub-system in accordance with various embodiments; FIG. 4 is a schematic block diagram of an embodiment of a parallelized data input sub-system in accordance with various embodiments; FIG. 5 is a schematic block diagram of an embodiment of a parallelized query and response (Q&R) sub-system in accordance with various embodiments; FIG. 6 is a schematic block diagram of an embodiment of a parallelized data store, retrieve, and/or process (IO& P) sub-system in accordance with various embodiments; FIG. 7 is a schematic block diagram of an embodiment of a computing device in accordance with various embodiments; FIG. 8 is a schematic block diagram of another embodiment of a computing device in accordance with various embodiments; FIG. 9 is a schematic block diagram of another embodiment of a computing device in accordance with various embodiments; FIG. 10 is a schematic block diagram of an embodiment of a node of a computing device in accordance with various embodiments; FIG. 11 is a schematic block diagram of an embodiment of a node of a computing device in accordance with various embodiments; FIG. 12 is a schematic block diagram of an embodiment of a node of a computing device in accordance with various embodiments; FIG. 13 is a schematic block diagram of an embodiment of a node of a computing device in accordance with various embodiments; FIG. 14 is a schematic block diagram of an embodiment of operating systems of a computing device in accordance with various embodiments; FIGS. 15-23 are schematic block diagrams of an example of processing a table or data set for storage in the database system in accordance with various embodiments; FIG. 24A is a schematic block diagram of a query execution plan implemented via a plurality of nodes in accordance with various embodiments; FIGS. 24B-24D are schematic block diagrams of embodiments of a node that implements a query processing module in accordance with various embodiments; FIG. 24E is an embodiment is schematic block diagrams illustrating a plurality of nodes that communicate via shuffle networks in accordance with various embodiments; FIG. 24F is a schematic block diagram of a database system communicating with an external requesting entity in accordance with various embodiments; FIG. 24G is a schematic block diagram of a query processing system in accordance with various embodiments; FIG. 24H is a schematic block diagram of a query operator execution flow in a