US-12619591-B2 - Compact probabilistic data structure for storing log data
Abstract
A computer-implemented method is presented for storing log data generated in a distributed computing environment. The method includes: receiving a log line from a given storage entity; applying a first tokenization rule to create a plurality of base tokens; applying a second tokenization rule to create a plurality of combination tokens; applying a third tokenization rule to create a plurality of n-gram tokens; combining the plurality of tokens into a set of tokens; for each token in the set of tokens, storing a given token by applying a hash function to the given token to generate a hash value, where the given token is associated with the given storage entity containing the log line; updating a listing of entities with the given storage entity, where entries in the listing of entities are configured to identify more than one storage entity and each entry in the listing of entities specifies a unique set of storage entities; and storing the hash value, along with an address, in a token map table of a probabilistic data structure, where the address maps the hash value to an entry in the listing of entities.
Inventors
- Julian REICHINGER
- Renee TRISBERG
Assignees
- DYNATRACE LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20240710
Claims (10)
- 1 . A computer-implemented method for storing log data generated in a distributed computing environment, comprising: receiving a log line from a given storage entity; applying a first tokenization rule to the log line to create a plurality of base tokens, where each base token is a sequence of successive characters in the log line having same type; applying a second tokenization rule to the log line to create a plurality of combination tokens, where each combination token is comprised of two or more base tokens appended together; applying a third tokenization rule to the log line to create a plurality of n-gram tokens, where each n-gram token is an n-gram derived from a base token in the plurality of base tokens; combining tokens from the plurality of base tokens, the plurality of combination tokens and the plurality of n-gram tokens to form a set of tokens; for each token in the set of tokens, storing a given token by applying a hash function to the given token to generate a hash value, where the given token is associated with the given storage entity containing the log line; updating a listing of storage entities with the given storage entity, where entries in the listing of storage entities are configured to identify more than one storage entity where the log line was stored and each entry in the listing of storage entities specifies a unique set of storage entities; and storing the hash value, along with an address, in a token map table of a probabilistic data structure, where the address maps the hash value to an entry in the listing of storage entities.
- 2 . The method of claim 1 where the first tokenization rule defines a base token as one of a sequence of successive alphanumeric characters in the log line; a sequence of successive separator characters in the log line; or a sequence of successive characters outside of 7-bit ASCII space.
- 3 . The method of claim 1 wherein the second tokenization rule defines a combination token as two base tokens comprised of alphanumeric characters and connected by a separator character.
- 4 . The method of claim 1 wherein the second tokenization rule defines a combination token as three base tokens comprised of alphanumeric characters connected by a dot.
- 5 . The method of claim 1 wherein each entry in the listing of storage entities further includes a counter, a list of storage entities and a hash value for the list of storage entities, where the counter indicates the number of entries in the probabilistic data structure mapped to a given entry in the listing of storage entities.
- 6 . The method of claim 5 wherein updating the listing of storage entities with the given storage entity further comprises determining whether the hash value is stored in the probabilistic data structure; retrieving an entry in the listing of storage entities corresponding to the hash value, where the entry is retrieved in response to a determination that the hash value is stored in the probabilistic data structure; from the retrieved entry, determining whether the given storage entity is in the list of storage entries; incrementing the counter by one in the retrieved entry in response to a determination that the given storage entity is contained in the list of storage entities; and creating a new entry in the listing of storage entities and setting the counter in the new entry to one in response to a determination that the given storage entity is absent from the list of storage entities.
- 7 . The method of claim 6 wherein creating a new entry in the listing of storage entities includes generating a hash value for the list of storage entities using a commutative hash function.
- 8 . The method of claim 6 wherein updating the listing of storage entities with the given storage entity further comprises adding an entry into the probabilistic data structure for the hash value, where the entry is added in response to a determination that the hash value is absent from the probabilistic data structure.
- 9 . The method of claim 5 further comprises creating a lookup table having an index, where each entry in the lookup table includes a hash value for the list of storage entities and each entry corresponds to an entry in the listing of storage entities.
- 10 . The method of claim 9 wherein each entry in the lookup table includes an address, where the address maps the hash value of the entry to an entry in the listing of storage entities.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation-in-part of U.S. application Ser. No. 18/383,031, filed on Oct. 23, 2023 which is a continuation-in-part of U.S. application Ser. No. 18/119,331, filed on Mar. 9, 2023. These applications also claim the benefit of U.S. Provisional Application No. 63/437,865, filed on Jan. 9, 2023. The entire disclosure of the above applications is incorporated herein by reference. FIELD The disclosure concerns a computer-implemented method for storing log data generated in a distributed computing environment. BACKGROUND Operating systems for clients and servers, applications (both locally installed, web based and hybrid), networks including the cloud, and containers (such as Docker or Kubernetes) etc. generate log messages during their operation. Subsequently, the terms log message, log line or simply log are used synonymously for each other as all these terms are used frequently in the art. Each log line is associated to an identity (short ID), i.e. the ID (typically a number) of an entity, e.g. a specific storage location or more specifically a batch or bucket of data containing the log line. Logs can be stored in compressed or uncompressed form. Generally, logs can be stored and analyzed for many different purposes. In the field of application/computer monitoring, log lines are used to detect anomalies occurring during the operation of the computer system. Since an anomaly is often not detected in real time, log lines are typically stored in a database (short DB). The DB containing the log lines as data can be queried/searched later for one or more keywords in order to identify the root cause of the anomaly. The DB can be a local DB, e.g. a DB stored on a server in the local network, a hybrid DB, e.g. where a DB index is stored locally and the log lines are stored in the cloud, or a cloud DB, where both the index and the data are stored in a cloud network. In response to a database query, the ID of a storage location is returned to the query consumer which indicates where, i.e. in which storage entity, a logline comprising the keyword is stored. Typically, the storage location or data batch containing matching logs is downloaded for inspection. In many cases, the keyword, e.g. an IP address, is present in multiple locations/batches. In a subsequent analysis step, the root cause of the problem can be identified by examining the occurrence of log lines over time. Many database management systems (short DBMS) exist which allow the storage and the indexing of log lines in a database. Working without an index is currently not an option since it would take 71000 CPU cores approx. 10 s to search and find an IP address in 700 TB of log data. This is both in terms of the required CPU power and time unacceptable. On the other hand, DBMS exist that can handle big data and can perform queries within a reasonable period of time. One such solution is the Elastic/Lucene DBMS, which is widely used in the industry. However, even established solutions have limitations, as indexing is a tradeoff between query performance and storage costs. In some cases, a full index done by an Elastic/Lucene DBMS may be larger than the actual data from which the index is constructed in compressed form. This is an issue particularly when massive amounts of data/log lines are stored over longer periods of time. It is noted that although computer systems become ever more powerful, the rate of data ingest exceeds the performance gain of computers by far. Consequently, there is a need to make storing log data lines in a data structure more efficient, the data structure more compact, and querying the data structure both faster and more flexible than in prior art systems. This section provides background information related to the present disclosure which is not necessarily prior art. SUMMARY This section provides a general summary of the disclosure and is not a comprehensive disclosure of its full scope or all of its features. According to a first aspect of the disclosure, the technical problem is solved by a computer-implemented method for storing log data generated in a distributed computing environment, comprising: receiving a log line from a given storage entity; applying a first tokenization rule to the log line to create a plurality of base tokens, where each base token is a sequence of successive characters in the log line having same type; applying a second tokenization rule to the log line to create a plurality of combination tokens, where each combination token is comprised of two or more base tokens appended together; applying a third tokenization rule to the log line to create a plurality of n-gram tokens, where each n-gram token is an n-gram derived from a base token in the plurality of base tokens; combining tokens from the plurality of base tokens, the plurality of combination tokens and the plurality of n-gram tokens to form a set of tokens; for each token in the set of to