US-12619610-B2 - Estimating cost of query execution on a set of data accessible to a computing system
Abstract
Estimating a cost of executing a query on a set of data involves executing logic to: estimate a size of each datum in the set of data; receive a query specifying a value for a first datum and a plurality of additional datum in the set of data associated with the first datum, and a maximum number of first datum to be retrieved from the set of data that have the specified value; estimate a cost of executing the query based on the maximum number of first datum to be retrieved from the set of data that have the specified value, the plurality of additional datum associated with the first datum, and the estimated size of the first datum and each of the additional datum associated with the first datum; and execute the query on the set of data responsive to the estimated cost.
Inventors
- Hazim Avdal
Assignees
- CROWDSTRIKE, INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20240926
Claims (20)
- 1 . A computer-implemented method performed by a system having at least a processor and a memory therein to execute instructions for estimating a cost of querying a set of data maintained in an electronic data store accessible to the system, the set of data having a varying plurality of records each comprising a plurality of datum at least one of which varies in size, the method comprising: executing logic to continually or repeatedly estimate a current shape of the set of data maintained in the electronic data store, including executing logic to estimate a current size of each of the plurality of datum in the plurality of records in the set of data; receiving via an interface of the system a query specifying: a value for a first datum in the plurality of records in the set of data, a selection of an additional one or more of the plurality of datum in the plurality of records in the set of data, and a maximum number of first datum to be retrieved from the plurality of records in the set of data that have a value that matches the specified value for the first datum; executing logic to estimate a cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data obtained from the estimated current shape of the set the data; and executing the query on the set of data responsive to the estimated cost of executing the query.
- 2 . The computer-implemented method of claim 1 , wherein executing logic to estimate the current size of each of the plurality of datum in the plurality of records in the set of data comprises executing logic to estimate the current size of each of the plurality of datum across a plurality of sample records in the set of data.
- 3 . The computer-implemented method of claim 2 , wherein executing logic to estimate the current size of each of the plurality of datum across the plurality of sample records in the set of data comprises executing logic to estimate the current size of each of the plurality of datum across a plurality of uniformly distributed and randomly chosen sample records in the set of data.
- 4 . The computer-implemented method of claim 2 , wherein executing logic to estimate the current size of each of the plurality of datum across the plurality of sample records in the set of data comprises executing logic to estimate one of a mean, a median, and a mode, for the current size of at least one datum across the plurality of sample records in the set of data.
- 5 . The computer-implemented method of claim 1 , wherein executing logic to estimate the cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data obtained from the estimated current shape of the set of data, comprises executing logic to estimate a number of bytes to be fetched from the electronic data store to execute the query on the set of data.
- 6 . The computer-implemented method of claim 2 , wherein executing logic to estimate the cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data obtained from the estimated current shape of the set of data, comprises: executing logic to estimate a current size of the first datum and each of the selected additional one or more of the plurality of datum based on the estimated current size of each of the plurality of datum across the plurality of sample records in the set of data; executing logic to sum the estimated current size of the first datum and the estimated current size of each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data; and executing logic to multiply the summed estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data by the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum.
- 7 . The computer-implemented method of claim 1 , further comprising executing logic to compare the estimated cost of executing the query on the set of data to a selected limit for a cost of executing a query on the set of data; and wherein executing the query on the set of data comprises executing the query on the set of data where the estimated cost of executing the query on the set of data is less than the selected limit for the cost of executing the query on the set of data.
- 8 . The computer-implemented method of claim 7 , wherein executing the query on the set of data comprises, where the estimated cost of executing the query on the set of data is greater than the selected limit for the cost of executing the query on the set of data, one or more of: transmitting a notification via the interface of the system that the estimated cost of executing the query on the set of data exceeds the selected limit for the cost of executing the query on the set of data; delaying executing the query on the set of data for a period of time; scheduling executing the query on the set of data according to a limit on a number of queries that can be executed against the set of data within a specified time period; and transmitting a request for input to authorize or pay a fee to execute the query on the set of data.
- 9 . The computer-implemented method of claim 1 wherein executing the query on the set of data comprises executing the query on the set of data responsive to the estimated cost of executing the query and one or more of: a throttling algorithm that controls when to execute the query against the set of data, a rate-limiting algorithm that limits a number of queries that can be executed against the set of data within a specific time period, and a Cost Of Goods Sold (COGS) bounding algorithm that tracks a cost incurred to execute the query against the set of data.
- 10 . The computer-implemented method of claim 1 , wherein executing logic to estimate the cost of executing the query comprises executing logic to estimate at least one of a computational cost to the system of executing the query and a dollar cost to a user or organization for executing the query.
- 11 . A computer system, comprising: one or more processors; a memory to store computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to execute instructions for estimating a cost of querying a set of data maintained in an electronic data store accessible to the system, the set of data having a varying plurality of records each comprising a plurality of datum at least one or which varies in size, the method comprising: executing logic to continually or repeatedly estimate a current shape of the set of data maintained in the electronic data store, including executing logic to estimate a current size of each of the plurality of datum in the plurality of records in the set of data; receiving via an interface of the system a query specifying: a value for a first datum in the plurality of records in the set of data, a selection of an additional one or more of the plurality of datum in the plurality of records in the set of data, and a maximum number of first datum to be retrieved from the plurality of records in the set of data that have a value that matches the specified value for the first datum; executing logic to estimate a cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data obtained from the estimated current shape of the set of data; and executing the query on the set of data responsive to the estimated cost of executing the query.
- 12 . The computer system of claim 11 , wherein executing logic to estimate the current size of each of the plurality of datum in the plurality of records in the set of data comprises executing logic to estimate the current size of each of the plurality of datum across a plurality of sample records in the set of data.
- 13 . The computer system of claim 12 , wherein executing logic to estimate the current size of each of the plurality of datum across the plurality of sample records in the set of data comprises executing logic to estimate the current size of each of the plurality of datum across a plurality of uniformly distributed and randomly chosen sample records in the set of data.
- 14 . The computer system of claim 12 , wherein executing logic to estimate the current size of each of the plurality of datum across the plurality of sample records in the set of data comprises executing logic to estimate one of a mean, a median, and a mode, for the current size of at least one datum across the plurality of sample records in the set of data.
- 15 . The computer system of claim 11 , wherein executing logic to estimate the cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum In the plurality of records in the set of data obtained from the estimated current shape of the set of data, comprises executing logic to estimate a number of bytes to be fetched from the electronic data store to execute the query on the set of data.
- 16 . The computer system of claim 12 , wherein executing logic to estimate the cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data obtained from the estimated current shape of the set of data, comprises: executing logic to estimate a current size of the first datum and each of the selected additional one or more of the plurality of datum based on the estimated current size of each of the plurality of datum across the plurality of sample records in the set of data; executing logic to sum the estimated current size of the first datum and the estimated current size of each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data; and executing logic to multiply the summed estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data by the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum.
- 17 . The computer system of claim 11 , further comprising executing logic to compare the estimated cost of executing the query on the set of data to a selected limit for a cost of executing a query on the set of data; and wherein executing the query on the set of data comprises executing the query on the set of data responsive to comparing the estimated cost of executing the query on the set of data to the selected limit for the cost of executing the query on the set of data.
- 18 . The computer system of claim 11 , wherein executing the query on the set of data comprises executing the query on the set of data responsive to the estimated cost of executing the query and one or more of: a throttling algorithm that controls when to execute the query against the set of data, a rate-limiting algorithm that limits a number of queries that can be executed against the set of data within a specific time period, and a Cost Of Goods Sold (COGS) bounding algorithm that tracks a cost incurred to execute the query against the set of data.
- 19 . The computer system of claim 11 , wherein executing logic to estimate the cost of executing the query comprises executing logic to estimate at least one of a computational cost to the system of executing the query and a dollar cost to a user or organization for executing the query.
- 20 . A non-transitory computer-readable medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to execute instructions for estimating a cost of querying a set of data maintained in an electronic data store accessible to the system, the set of data having a varying plurality of records each comprising a plurality of datum at least one of which varies in size, the method comprising: executing logic to continually or repeatedly estimate a current shape of the set of data maintained in the electronic data store, including executing logic to estimate a current size of each of the plurality of datum in the plurality of records in the set of data; receiving via an interface of the system a query specifying: a value for a first datum in the plurality of records in the set of data, a selection of an additional one or more of the plurality of datum in the plurality of records in the set of data, and a maximum number of first datum to be retrieved from the plurality of records in the set of data that have a value that matches the specified value for the first datum; executing logic to estimate a cost of executing the query based on the maximum number of first datum to be retrieved from the plurality of records in the set of data that have the value that matches the specified value for the first datum, the selected additional one or more of the plurality of datum in the plurality of records in the set of data, and the estimated current size of the first datum and each of the selected additional one or more of the plurality of datum in the plurality of records in the set of data obtained from the estimated current shape of the set of data; and executing the query on the set of data responsive to the estimated cost of executing the query.
Description
TECHNICAL FIELD Embodiments relate to digital computing systems, particularly to estimating a size of a query on a set of data and a cost of executing the query on the set of data based on the estimated query size. BACKGROUND Computing systems that receive complex and/or adjustable queries, such as user queries directed to a database system, tend to have a variable resource utilization schedule. In other words, the amount of computing resources, hardware, databases, servers, memory constraints, network resources, etc., in use at a given point in time may vary according to time of day and/or other actions or applications using the computing system resources. Estimating the computational cost of such queries has many useful applications such as enforcing query throttling (i.e., controlling the number of queries the computer system will accept in a given period of time to prevent overwhelming or overloading the system), query rate-limiting (i.e., limiting the number of queries from a particular user or client over a given period of time), and Cost-Of-Goods-Sold (COGs) bounding (i.e., limiting the number of queries from a particular user or client based on the computational and/or financial costs associated with providing responses to the user/client queries), among other practical considerations. Rate limiting and throttling are well-known bounding techniques in software development. Query size estimation is also a common practice in some database systems, such as MySQL. Typically, these practices use static bounding or rely on the shape of the query itself to perform query size estimations. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features. FIG. 1 is a flowchart 100 describing aspects of embodiments of the invention. FIG. 2 is a flowchart 102 describing aspects of embodiments of the invention. FIG. 3 is a flowchart 106 describing aspects of embodiments of the invention. FIG. 4 is a flowchart 400 describing aspects of embodiments of the invention. FIG. 5 illustrates an example architecture for a computing device capable of carrying out an embodiment of the invention. DETAILED DESCRIPTION Embodiments estimate query size by first performing an empirical analysis of the shape of the data in a data store (rather than only analyzing the shape of the query submitted to the data store), and then estimating the size of the query, for example, by estimating the number of bytes expected to be fetched from the data store in response to the query, based on the empirical analysis of the shape of the data. The empirical analysis may be performed repeatedly or continuously so that the analysis is current and up to date. In short, the disclosed embodiments involve dynamically estimating a cost of executing a query on a set of data, e.g., the disclosed embodiments involve estimating the cost to fetch the number of bytes expected to be fetched from the data store in response to the query, given the recent or current shape of the data in the data store. One example context or application for the disclosed embodiments is cybersecurity. Embodiments can be applied to a threat detection storage system which responds to user queries by returning a varying set of “facets” per threat detection query, e.g., a facet of a threat detection document. In such an application, clients provide a predicate that will select a subset of the threat detection data, a set of facets to return per match, as well as the number of matches to return per request. As such, the cost of each query varies significantly based on user-controlled parameters. Predicting the cost of a particular query is not only important for efficient resource allocation, but neglecting or ignoring the cost of the query makes such a service vulnerable to Denial of Service (DoS) attacks. While this can be mitigated by setting static upper bounds on user-provided inputs, a dynamic model ensures higher resource utilization compared to a static model. Of course, embodiments of the invention are not limited to the aforesaid example application. As disclosed in further detail below, embodiments execute logic to: estimate a size of each datum in the set of data; receive a query specifying a value for a first datum and a plurality of additional datum in the set of data associated with the first datum, and a maximum number of first datum to be retrieved from the set of data that have the specified value; estimate a cost of executing the query based on the maximum number of first datum to be retrieved from the set of data that have the specified value, the plurality of additional datum associated with the first datum to be retrieved, and the estimated size of the first datum and each of the ad