US-12621333-B1 - System, method, and computer program for context enrichment of logs in a cybersecurity system
Abstract
The present disclosure describes a system, method, and computer program for enriching logs with cybersecurity threat information. The system receives a first input data stream with logs usable for cybersecurity evaluation and a second input data stream with data values (e.g., IP address, user IDs, etc.) that are associated with a cybersecurity threat (“threat values”). The system stores the threat values in a datastore as they are received. As logs are received, the system enriches each log having a data value that matches a threat value in the datastore. As compared to conventional systems, the system significantly reduces the number of context queries on the datastore by using a probabilistic filter to determine if there is a zero or non-zero probability of a data value in a log being one of the threat values. If there is a zero probability, the system is able to determine that the log does not have one of the threat values and, thus, can enrich the log based solely on the output of the probabilistic filter.
Inventors
- Dinesh Maheshwari
- Kenshin Sakura
Assignees
- Exabeam, Inc.
Dates
- Publication Date
- 20260505
- Application Date
- 20230427
Claims (20)
- 1 . A non-transitory computer-readable medium comprising a computer program, that, when executed by a computer system, enables the computer system to perform the following method for enriching logs with cybersecurity threat information, the method comprising: receiving a first input data stream and a second input data stream, wherein the first input data stream includes logs usable for cybersecurity evaluation and the second input data stream includes data values that are associated with a cybersecurity threat; storing the data values associated with the cybersecurity threat in a first datastore; for each log in the first input data stream having a select data field, processing the log by performing the following: identifying a data value in the select data field; using a probabilistic filter to determine whether there is a zero probability of the first datastore having a data value matching the data value in the select data field; in response to there being a zero probability of the first datastore having the matching data value, determining that there is no indication in the first datastore that the data value in the select data field is associated with a cybersecurity threat; in response to there being a non-zero probability of the first datastore having the matching data value, searching the first datastore for the matching data value and, in response to finding the matching data value in the first datastore, enriching the log to indicate that the data value in the select data field is associated with a cybersecurity threat; and outputting the log into an output data stream.
- 2 . The non-transitory computer-readable medium of claim 1 , further comprising: in response to there being a zero probability of the first datastore having the matching data value, enriching the log to indicate that the data value in the select data field is not associated with a cybersecurity threat.
- 3 . The non-transitory computer-readable medium of claim 1 , wherein the logs relate to one or more computer-based systems being monitored by a cybersecurity system.
- 4 . The non-transitory computer-readable medium of claim 1 , wherein the data values in the first datastore are IP addresses associated with one or more of the following: ransomware, phishing, malware, and a trojan attack.
- 5 . The non-transitory computer-readable medium of claim 1 , wherein the data values in the first datastore are user IDs associated with one or more of the following: a compromised user, a malicious user, a watched user, and a high value user.
- 6 . The non-transitory computer-readable medium of claim 1 , wherein a subset of the data values in the second input data stream are stored in a second datastore and wherein prior to the step of using the probabilistic filter, the method further comprises: for each log having the select data field, searching the second datastore for a data value matching the data value in the select data field; in response to finding the matching data value in the second datastore, enriching the log to indicate that the data value in the select data field is associated with a cybersecurity threat, bypassing the probabilistic filter and the first datastore, and outputting the enriched log in the output data stream; and in response to not finding the matching data value in the second datastore, proceeding with the step of using the probabilistic filter.
- 7 . The non-transitory computer-readable medium of claim 6 , wherein the first datastore is a distributed datastore and the second datastore is a local datastore.
- 8 . The non-transitory computer-readable medium of claim 7 , wherein the first datastore is a distributed in-memory cache and the second datastore is a local cache running on the system on which the logs are received.
- 9 . The non-transitory computer-readable medium of claim 7 , wherein, in response to searching and finding the matching data value in the distributed in-memory cache, adding the matching data value to the local cache.
- 10 . A computer system for enriching logs with cybersecurity threat information, the system comprising: one or more processors; one or more memory units coupled to the one or more processors, wherein the one or more memory units store instructions that, when executed by the one or more processors, cause the system to perform the operations of: receiving a first input data stream and a second input data stream, wherein the first input data stream includes logs usable for cybersecurity evaluation and the second input data stream includes data values that are associated with a cybersecurity threat; storing the data values associated with the cybersecurity threat in a first datastore; for each log in the first input data stream having a select data field, processing the log by performing the following: identifying a data value in the select data field; using a probabilistic filter to determine whether there is a zero probability of the first datastore having a data value matching the data value in the select data field; in response to there being a zero probability of the first datastore having the matching data value, determining that there is no indication in the first datastore that the data value in the select data field is associated with a cybersecurity threat; in response to there being a non-zero probability of the first datastore having the matching data value, searching the first datastore for the matching data value and, in response to finding the matching data value in the first datastore, enriching the log to indicate that the data value in the select data field is associated with a cybersecurity threat; and outputting the log into an output data stream.
- 11 . The system of claim 10 , further comprising: in response to there being a zero probability of the first datastore having the matching data value, enriching the log to indicate that the data value in the select data field is not associated with a cybersecurity threat.
- 12 . The system of claim 10 , wherein the data values in the first datastore are IP addresses associated with one or more of the following: ransomware, phishing, malware, and a trojan attack.
- 13 . The system of claim 10 , wherein the data values in the first datastore are user IDs associated with one or more of the following: a compromised user, a malicious user, a watched user, and a high value user.
- 14 . The system of claim 10 , wherein a subset of the data values in the second input data stream are stored in a second datastore and wherein prior to the step of using the probabilistic filter, the method further comprises: for each log having the select data field, searching the second datastore for a data value matching the data value in the select data field; in response to finding the matching data value in the second datastore, enriching the log to indicate that the data value in the select data field is associated with a cybersecurity threat, bypassing the probabilistic filter and the first datastore, and outputting the enriched log in the output data stream; and in response to not finding the matching data value in the second datastore, proceeding with the step of using the probabilistic filter.
- 15 . The system of claim 14 , wherein the first datastore is a distributed in-memory cache and the second datastore is a local cache running on the computer on which the logs are received.
- 16 . A method, performed by a computer-based cybersecurity system, for enriching logs with cybersecurity threat information, the method comprising: receiving a first input data stream and a second input data stream, wherein the first input data stream includes logs usable for cybersecurity evaluation and the second input data stream includes data values that are associated with a cybersecurity threat; storing the data values associated with the cybersecurity threat in a first datastore; for each log in the first input data stream having a select data field, processing the log by performing the following: identifying a data value in the select data field; using a probabilistic filter to determine whether there is a zero probability of the first datastore having a data value matching the data value in the select data field; in response to there being a zero probability of the first datastore having the matching data value, determining that there is no indication in the first datastore that the data value in the select data field is associated with a cybersecurity threat; in response to there being a non-zero probability of the first datastore having the matching data value, searching the first datastore for the matching data value and, in response to finding the matching data value in the first datastore, enriching the log to indicate that the data value in the select data field is associated with a cybersecurity threat; and outputting the enriched log into an output data stream.
- 17 . The method of claim 16 further comprising: in response to there being a zero probability of the first datastore having the matching data value, enriching the log to indicate that the data value in the select data field is not associated with a cybersecurity threat.
- 18 . The method of claim 16 , wherein the data values in the first datastore are IP addresses associated with one or more of the following: ransomware, phishing, malware, and a trojan attack.
- 19 . The method of claim 16 , wherein the data values in the first datastore are user IDs associated with one or more of the following: a compromised user, a malicious user, a watched user, and a high value user.
- 20 . The method of claim 16 , wherein a subset of the data values in the second input data stream are stored in a second datastore and wherein prior to the step of using the probabilistic filter, the method further comprises: for each log having the select data field, searching the second datastore for a data value matching the data value in the select data field; in response to finding the matching data value in the second datastore, enriching the log to indicate that the data value in the select data field is associated with a cybersecurity threat, bypassing the probabilistic filter and the first datastore, and outputting the enriched log in the output data stream; and in response to not finding the matching data value in the second datastore, proceeding with the step of using the probabilistic filter.
Description
BACKGROUND OF THE INVENTION 1. Field of the Invention This invention relates generally to cybersecurity systems, and, more specifically, to enriching logs with cybersecurity threat information. 2. Description of the Background Art Cybersecurity systems monitor entity behavior in a network in order to detect cybersecurity threats. An entity may be a user or a machine. As entities interact with the network, various systems generate raw logs related to the entity behavior. For example, a cybersecurity system may obtain raw data logs related to a user's interactions with the IT infrastructure, such as user logon events, server access events, application access events, and data access events. The raw data logs may be obtained from third party systems. Cybersecurity system will typically take these raw data logs and generate event logs from the raw data logs. For entity behavior modeling in IT network security analytics, it is critical to leverage contextual information to improve alert accuracy. For example, contextual information can be used to construct and evaluate context-specific rules. As a result, event logs are often supplemented with additional data that provides further context for the entity events. For example, if an event relates to an IP address associate with prior cybersecurity threats, this is important context information for evaluating current threats. The supplementing of logs with additional context information is known as “context enrichment.” FIG. 1 illustrates the conventional method for context enrichment in substantially real time. A context-enrichment module 110 receives a data stream of logs (i.e., log pipeline 120) and a data stream of context data (i.e., data pipeline 130). The goal is to join the logs with the context data as the logs are received in real time. To do so, local in-memory databases 140 are created for the logs, and local in-memory databases 150 are created for the context data. The context-enrichment module performs “map/shuttle/reduce” operations on the in-memory databases to obtain cross-joined data 160. Local in-memory databases (as opposed to remote, distributed caches) are required using this technique in order to provide substantially real-time context enrichment. The module outputs the enriched logs into downstream pipeline 170. This process consumes a huge amount of computing and memory resources. For example, processing 100 million event logs per second using this conventional method would require several hundred gigs of RAM. As cybersecurity becomes increasingly more important in this digital age, large, distributed cybersecurity systems likely need to scale beyond enriching 100 million logs per second per region. To achieve this, it is essential to perform context enrichment in a way that minimizes the hit to the log pipeline throughput. As such, context queries need to be completed with low latency (e.g., less than 1 millisecond per query) and support high concurrency. Such low-latency requirements require in-memory storage techniques, as going to disk to fetch context data is too slow for such use cases. Therefore, there is demand for a context-enrichment method that satisfies the low latency requirements required to significantly scale real-time context enrichment capabilities while at the same time using significantly less computing and memory requirements than known methods. SUMMARY OF THE DISCLOSURE The present disclosure describes a system, method, and computer program for enriching logs with cybersecurity threat information. The system receives a first input data stream with logs usable for cybersecurity evaluation and a second input data stream with data values (e.g., IP address, user IDs, etc.) that are associated with a cybersecurity threat (“threat values”). The system stores the threat values in a first datastore as they are received. As the logs are received, the system enriches each log having a data value that matches a threat value in the datastore to indicate that the data value is associated with a cybersecurity threat. As compared to conventional systems, the system significantly reduces the number of context queries on the datastore by using a probabilistic filter to determine if there is a zero or non-zero probability of a log having one of the threat values. If there is a zero probability of a log having one of the threat values, the system enriches the log based solely on the output of the probabilistic filter. If there is a non-zero probability of the log having one of the threat values, the system queries the first datastore to see if there is a threat value matching a data value in the log. As compared to conventional context-enrichment techniques, the novel method disclosed herein significantly reduces the number of context queries on a database of threat values. For example, it is typical for about 1% of IP addresses in logs processed by a cybersecurity system to be IP addresses associated with a known threat. In such case, the