US-12619587-B2 - System for retrieval of large datasets in cloud environments

US12619587B2US 12619587 B2US12619587 B2US 12619587B2US-12619587-B2

Abstract

A system and method are provided that store electronic data describing events that have occurred in a computing system, index the electronic data to create indexed data records; and store the indexed data records in computer memory as part of a flat data structure.

Inventors

JEAN-PHILIPPE BERGERON
Michael John Cyze

Assignees

MICRO FOCUS LLC

Dates

Publication Date: 20260505
Application Date: 20210209

Claims (17)

1 . A method, comprising: storing, by a microprocessor, electronic data describing events that have occurred in a computing system; indexing, by the microprocessor, the electronic data to create indexed data records; storing, by the microprocessor, the indexed data records in computer memory as part of a flat data structure; determining, by the microprocessor, for each electronic data, a number of n-grams that are represented within a first electronic data; assigning, by the microprocessor, a document identifier associated with the first electronic data, wherein document identifiers are unique to a shard corresponding to a subset of the electronic data but not globally unique; correlating, by the microprocessor, the document identifier with each of the number of n-grams that are represented within the first electronic data; and storing by the microprocessor, the document identifier in a delta-compressed list storing a plurality of delta-compressed document identifier values, wherein each stored value of each document identifier in the delta-compressed list represents a numerical difference between sequential document identifiers excluding a first document identifier; and wherein the delta-compressed list is further compressed by storing differences between adjacent delta values.
2 . The method of claim 1 , further comprising: merging, by the microprocessor, the indexed data records to create merged indexed data records, the indexed data records being stored as the merged indexed data records; discovering, by the microprocessor, a new electronic data describing a new event in the computing system; in response to discovering the new electronic data, indexing, by the microprocessor, the new electronic data to create a new indexed data record; merging, by the microprocessor, the new indexed data record with the merged indexed data records; and storing, by the microprocessor, the new indexed data record in the computer memory as part of the flat data structure.
3 . The method of claim 1 , further comprising: receiving, by the microprocessor, a database query that comprises a query term; searching, by the microprocessor, the flat data structure for a match between the query term and an n-gram stored in the flat data structure; and returning, by the microprocessor, a document identifier associated with the match between the query term and the n-gram stored in the flat data structure.
4 . The method of claim 3 , wherein the database query comprises a string query, and wherein the n-gram comprises a trigram.
5 . The method of claim 3 , wherein the flat data structure is stored in a cloud-computing environment and further comprising: receiving, by the microprocessor, a second database query that comprises a second query term; searching, by the microprocessor, the flat data structure for a match between the second query term and another n-gram stored in the flat data structure, wherein the flat data structure is searched for the second query term in parallel with being searched for the query term; and returning, by the microprocessor, a second document identifier associated with the match between the second query term and the another n-gram stored in the flat data structure.
6 . The method of claim 1 , wherein the flat data structure correlates n-grams to document identifiers, and wherein the n-grams are encoded as integer values.
7 . The method of claim 1 , wherein the electronic data comprises data files received from endpoints in the computing system, and wherein the flat data structure is searchable immediately after having the indexed data records stored therein.
8 . A database management system, comprising: a network interface to send and receive communications; a microprocessor in communication with the network interface; and a computer readable medium coupled with the microprocessor and comprising one or more sets of instructions that, when executed by the microprocessor, cause the microprocessor to: store electronic data describing events that have occurred in a computing system; index the electronic data to create indexed data records; store the indexed data records in computer memory as part of a flat data structure; determine, for each electronic data, a number of n-grams that are represented within a first electronic data; assign a document identifier associated with the first electronic data, wherein document identifiers are unique to a shard corresponding to a subset of the electronic data but not globally unique; correlate the document identifier with each of the number of n-grams that are represented within the first electronic data; store the document identifier in a delta-compressed list storing a plurality of delta-compressed document identifier values, wherein each stored value of the document identifiers in the delta-compressed list represents a numerical difference between sequential document identifiers excluding a first document identifier; and wherein the delta-compressed list is further compressed by storing differences between adjacent delta values.
9 . The database management system of claim 8 , wherein the microprocessor, when executing the one or more sets of instructions, further: merges the indexed data records to create merged indexed data records, the indexed data records being stored as the merged indexed data records; discovers a new electronic data describing a new event in the computing system; in response to discovering the new electronic data, indexes the new electronic data to create a new indexed data record; merges the new indexed data record with the merged indexed data records; and stores the new indexed data record in the computer memory as part of the flat data structure.
10 . The database management system of claim 8 , wherein the microprocessor, when executing the one or more sets of instructions, further: receives a database query that comprises a query term; searches the flat data structure for a match between the query term and an n-gram stored in the flat data structure; and returns a document identifier associated with the match between the query term and the n-gram stored in the flat data structure.
11 . The database management system of claim 10 , wherein the database query comprises a string query, wherein the n-gram comprises a trigram, wherein the flat data structure is stored in a cloud-computing environment and wherein the microprocessor, when executing the one or more sets of instructions, further: receives a second database query that comprises a second query term; searches the flat data structure for a match between the second query term and another n-gram stored in the flat data structure, wherein the flat data structure is searched for the second query term in parallel with being searched for the query term; and returns a second document identifier associated with the match between the second query term and the another n-gram stored in the flat data structure.
12 . The database management system of claim 8 , wherein the flat data structure correlates n-grams to document identifiers, and wherein the n-grams are encoded as integer values.
13 . The database management system of claim 8 , wherein the electronic data comprises data files received from endpoints in the computing system, and wherein the flat data structure is searchable immediately after having the indexed data records stored therein.
14 . A database management system, comprising: a network interface to send and receive communications; a microprocessor in communication with the network interface; and a computer readable medium coupled with the microprocessor and comprising one or more sets of instructions that, when executed by the microprocessor, cause the microprocessor to: receive a database search query comprising a query term to search electronic data describing events that have occurred in a computing system; convert the query term into an equivalent n-gram; search indexed data records configured as a flat data structure for electronic data matching the equivalent n-gram; and provide search results to a user, wherein the search results comprise a set of document identifiers, each document identifier being associated with electronic data in a shard, the shard corresponding to a subset of the electronic data, and wherein document identifiers are unique to a shard corresponding to a subset of the electronic data but not globally unique, wherein the document identifiers are stored in a delta-compressed list storing a plurality of delta-compressed document identifier values, wherein each stored value of the document identifiers in the delta-compressed list represents a numerical difference between sequential document identifiers excluding a first document identifier and a last document identifier; and wherein the delta-compressed list is further compressed by storing differences between adjacent delta values.
15 . The database management system of claim 14 , wherein the microprocessor converts the equivalent n-gram into an equivalent set of integers and uses the equivalent set of integers in the search of the indexed data records.
16 . The database management system of claim 14 , wherein the query term comprises plural sets of characters corresponding to multiple n-grams, and wherein the matching electronic data provided to the user is a union of matching electronic data corresponding to the multiple n-grams.
17 . The database management system of claim 14 wherein each document identifier has a corresponding byte offset of a record in the electronic data.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a national stage application under 35 U.S.C. 371 and claims the benefit of PCT Application No. PCT/US2021/017285 having an international filing date of 9 Feb. 2021, which designated the United States, the entire disclosures of each of which are incorporated herein by reference. FIELD The disclosure relates generally to database management and particularly to cloud storage and retrieval of large datasets. BACKGROUND Cloud-based computing systems, such as Amazon Web Services (AWS)™ Microsoft Azure™ or Google Compute Platform (GCP)™, are cloud providers that provide “serverless” dynamic management and allocation of machine resources and are an increasingly important deployment model for the development of cloud-based and software-as-a-service (SaaS) applications. Serverless platforms offer functionality that is not available in traditional deployment platforms, such as unlimited and inexpensive, but high latency, storage systems (e.g. AWS's S3™, GCP's Cloud Storage™, and Azure's Azure Storage™) and highly parallelizable, short-term computing capabilities (e.g. AWS's Lambda™, GCP's Cloud Functions™, and Azure's Azure Functions™). SUMMARY These and other needs are addressed by the various embodiments and configurations of the present disclosure. A method is provided that can include the steps: storing electronic data describing events that have occurred in a computing system;indexing the electronic data to create indexed data records; andstore the indexed data records in computer memory as part of a flat data structure. A database management system is provided that can include a network interface to send and receive communications, a microprocessor in communication with the network interface, and a computer readable medium coupled with the microprocessor and comprising one or more sets of instructions. When the instructions are executed by the microprocessor, the microprocessor can: store electronic data describing events that have occurred in a computing system;index the electronic data to create indexed data records; andstore the indexed data records in computer memory as part of a flat data structure. A database management system is provided that can include a network interface to send and receive communications, a microprocessor in communication with the network interface, and a computer readable medium coupled with the microprocessor and comprising one or more sets of instructions. When the instructions are executed by the microprocessor, the microprocessor can: receive a database search query comprising a query term that occurs in electronic data describing events that have occurred in a computing system;convert the query term into an equivalent n-gram;search indexed data records configured as a flat data structure for electronic data matching the equivalent n-gram; andprovide the matching electronic data to a user. The preceding is a simplified summary to provide an understanding of some aspects of the disclosure. This summary is neither an extensive nor exhaustive overview of the disclosure and its various embodiments. It is intended neither to identify key or critical elements of the disclosure nor to delineate the scope of the disclosure but to present selected concepts of the disclosure in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the disclosure are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below. Also, while the disclosure is presented in terms of exemplary embodiments, it should be appreciated that individual aspects of the disclosure can be separately claimed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a cloud-based system according to embodiments of this disclosure; FIG. 2 is a block diagram of a data management server according to an embodiment of this disclosure; FIG. 3 depicts an embodiment of a data management and retrieval process flow according to an embodiment of the present disclosure; FIG. 4 depicts various data structures according to an embodiment of the present disclosure; FIG. 5 depicts a data management process flow according to an embodiment of the present disclosure; and FIG. 6 depicts a data retrieval process flow according to an embodiment of the present invention. DETAILED DESCRIPTION The system and method of the present disclosure can address a number of technical problems. A common goal for data lakes is to store a massive amount of information at a small cost and allow for quick data retrieval. For cloud-based data lakes, storing massive amounts of data on cloud storage platforms, like S3™, is relatively inexpensive, but due to the high latencies, typical data retrieval times may be too slow for certain purposes, such as highly interactive user interfaces or real-time analytics systems. The typical way to retrieve data from a cloud-based da