US-12621330-B2 - Systems and methods for automated filter recommendation for sigma rules

US12621330B2US 12621330 B2US12621330 B2US 12621330B2US-12621330-B2

Abstract

System and method for automated filter determination of Sigma rules used in a malicious content detection system is provided. The system and method can include training a machine learning algorithm based on candidate sets that can be determined based on a plurality of sigma rules, and performing inference for a plurality candidate sets corresponding to a plurality of sigma rules based on the machine learning algorithm to detect malicious content.

Inventors

Swapna Sourav Rout
Aman Gupta
Ravi Kumar Raju Gottumukkala
Greg CROXFORD

Assignees

MORGAN STANLEY SERVICES GROUP INC.

Dates

Publication Date: 20260505
Application Date: 20240711

Claims (16)

1 . A computerized method for automated filter determination for sigma rules used in a malicious content detection system, the computerized method comprising: receiving, by a computing device, a first plurality of sigma rules, each sigma rule in the first plurality of signal rules comprising at least one or more first tags, one or more first log sources and one or more first selections, and wherein at least some of the sigma rules in the first plurality of sigma rules comprise one or more first filters; for each rule in the first plurality of sigma rules, determining, by the computing device, a first plurality of candidate sets by sampling the one or more first tags, the one or more first log sources, the one or more first selections, and the one or more first filters when present in the respective rule to create all possible combinations to include in the first plurality of candidate sets from the respective rule; training, by the computing device, a machine learning algorithm based on all of the first plurality of candidate sets determined for all of the first plurality of sigma rules; receiving, by the computing device, a second plurality of sigma rules, each sigma rule in the second plurality of signal rules comprising at least one or more second tags, one or more second log sources and one or more second selections; for each rule in the second plurality of sigma rules, determining, by the computing device, a second plurality of candidate sets by sampling the one or more second tags, the one or more second log sources, the one or more second selections to create all possible combinations to include in the second plurality of candidate sets from the respective rule; determining, by the computing device, one or more filters for each sigma rule in the second plurality of sigma rules by using all of the second plurality of candidate sets determined for all of the second plurality of sigma rules as input to the trained machine learning algorithm; and setting, by the computing device, the second plurality of sigma rules with the one or more determined filters in the rule as the sigma rules for the malicious content detection system.
2 . The computerized method of claim 1 further comprising: determining, by the computing device, a similarity score between each unique pair of candidate sets in the second plurality of candidate sets; and for each unique pair having a similarity score greater than a predetermined threshold, excluding one of the candidate sets in the unique pair.
3 . The computerized method of claim 1 wherein each candidate set in the first plurality of candidate sets has an order that causes the one or more filters present in any candidate set to be placed last.
4 . The computerized method of claim 1 wherein the one or more first tags, the one or more first log sources the one or more first selections, and the one or more first filters in the first plurality of candidate sets are in any order.
5 . The computerized method of claim 1 further comprising: preprocessing, by the computing device, the first plurality of sigma rules into a normalized form.
6 . The computerized method of claim 1 wherein the machine learning algorithm is a TRIE algorithm or recurrent neural network.
7 . The computerized method of claim 1 wherein the second plurality of candidate sets is input to training the machine learning algorithm.
8 . The computerized method of claim 1 wherein one or more rules in the first plurality of sigma rules further comprises a condition, and wherein creating all possible combinations to include in the first plurality of candidate sets from the respective one or more rules further comprises, creating a candidate set for each operator in a Boolean conditional statement in the respective condition, applying some filters according to the condition, or any combination thereof.
9 . One or more non-transitory computer-readable storage media comprising instructions that are executable to cause one or more processors to: receive a first plurality of sigma rules, each sigma rule in the first plurality of signal rules comprising at least one or more first tags, one or more first log sources and one or more first selections, and wherein at least some of the sigma rules in the first plurality of sigma rules comprise one or more first filters; for each rule in the first plurality of sigma rules, determine a first plurality of candidate sets by sampling the one or more first tags, the one or more first log sources, the one or more first selections, and the one or more first filters when present in the respective rule to create all possible combinations to include in the first plurality of candidate sets from the respective rule; train a machine learning algorithm based on all of the first plurality of candidate sets determined for all of the first plurality of sigma rules; receive a second plurality of sigma rules, each sigma rule in the second plurality of signal rules comprising at least one or more second tags, one or more second log sources and one or more second selections; for each rule in the second plurality of sigma rules, determine a second plurality of candidate sets by sampling the one or more second tags, the one or more second log sources, the one or more second selections to create all possible combinations to include in the second plurality of candidate sets from the respective rule; determine one or more filters for each sigma rule in the second plurality of sigma rules by using all of the second plurality of candidate sets determined for all of the second plurality of sigma rules as input to the trained machine learning algorithm; and set the second plurality of sigma rules with the one or more determined filters in the rule as chosen sigma rules for a malicious content detection system.
10 . The one or more non-transitory computer-readable storage media of claim 9 wherein the one or more processors are further caused to: determine a similarity score between each unique pair of candidate sets in the second plurality of candidate sets; and for each unique pair having a similarity score greater than a predetermined threshold, exclude one of the candidate sets in the unique pair.
11 . The one or more non-transitory computer-readable storage media of claim 9 wherein each candidate set in the first plurality of candidate sets has an order that causes the one or more filters present in any candidate set to be placed last.
12 . The one or more non-transitory computer-readable storage media of claim 9 wherein the one or more first tags, the one or more first log sources the one or more first selections, and the one or more first filters in the first plurality of candidate sets are in any order.
13 . The one or more non-transitory computer-readable storage media of claim 9 wherein the one or more processors are further caused to: preprocess the first plurality of sigma rules into a normalized form.
14 . The one or more non-transitory computer-readable storage media of claim 9 wherein the machine learning algorithm is a TRIE algorithm or recurrent neural network.
15 . The one or more non-transitory computer-readable storage media of claim 9 wherein the second plurality of candidate sets is input to training the machine learning algorithm.
16 . The one or more non-transitory computer-readable storage media of claim 9 wherein one or more rules in the first plurality of sigma rules further comprises a condition, and wherein creating all possible combinations to include in the first plurality of candidate sets from the respective one or more rules further comprises, creating a candidate set for each operator in a Boolean conditional statement in the respective condition, applying some filters according to the condition, or any combination thereof.

Description

FIELD OF THE INVENTION The invention relates generally systems and methods for tracking malicious activity in computing systems receiving messages using sigma rules. In particular, to systems and methods that automatically determine a filter for sigma rules for a malicious content detection system. BACKGROUND Currently, sigma rules can be used to detect malicious content for all communication coming into any system that employs sigma rules. Sigma rules are open source and written in a YAML format. In current systems, sigma rules are written by a system's admin and/or adapted from open source databases of available sigma rules. When malicious content is identified by the sigma rules from incoming communication, it can be logged by the system, in some instances causing very large logs (e.g., on the order of Terabytes). In some scenarios, the sigma rules can erroneously identify communication as malicious. Typically, the logs are manually reviewed (e.g., by a system administrator) to determine whether any of the communication that was identified as malicious was erroneously identified. For any communication that was erroneously identified as malicious, the sigma rules can be manually updated to include a filter. Manual malicious content review in logs and/or sigma rule updating is typically done weekly, bi-weekly or monthly. Difficulties can include human error, large amount of time and/or cost/resource spent. Therefore, it can be desirable to identify communication falsely identified as malicious in logs and/or update sigma rules automatically. SUMMARY OF THE INVENTION Advantages of the invention can include reducing an amount of time, cost and/or resource required to identify communication falsely identified as malicious in logs. Advantages of the invention can also include updating sigma rule filters automatically. Advantages of the invention can also include reducing a number of duplicate sigma rules. Advantages of the invention can also include increased accuracy, due to, for example, an ability to process thousands of sigma rules at a time to update filters, which can also be done in real-time or near real-time, providing the most accurate up to date filters. In one aspect, the invention involves a computerized method for automated filter determination for Sigma rules used in a malicious content detection system. The method involves receiving, by a computing device, a first plurality of sigma rules, each sigma rule in the first plurality of signal rules comprising at least one or more tags, one or more log sources and one or more selections, and wherein at least some of the sigma rules in the first plurality of sigma rules comprise one or more filters. The method involves for each rule in the first plurality of sigma rules, determining, by the computing device, a first plurality of candidate sets by sampling the one or more tags, the one or more log sources the one or more selections, and the one or more filters when present in the respective rule to create all possible combinations to include in the first plurality of candidate sets from the respective rule. The method involves training, by the computing device, a machine learning algorithm based on all of the first plurality of candidate sets determined for all of the first plurality of sigma rules. The method involves receiving, by the computing device, a second plurality of sigma rules, each sigma rule in the second plurality of signal rules comprising at least one or more tags, one or more log sources and one or more selections. The method involves for each rule in the second plurality of sigma rules, determining, by the computing device, a second plurality of candidate sets by sampling the one or more tags, the one or more log sources, and the one or more selections to create all possible combinations to include in the second plurality of candidate sets from the respective rule. The method involves determining, by the computing device, one or more filters for each sigma rule in the second plurality of sigma rules by using all of the second plurality of candidate sets determined for all of the second plurality of sigma rules as input to the trained machine learning algorithm. The method involves setting, by the computing device, the second plurality of sigma rules with the one or more determined filters in the rule as the sigma rules for the malicious content detection system. In some embodiments, the method involves determining, by the computing device, a similarity score between each unique pair of candidate sets in the second plurality of candidate sets and for each unique pair having a similarity score greater than a predetermined threshold, excluding one of the candidate sets in the unique pair. In some embodiments, each candidate set in the first plurality of candidate sets has an order that causes the one or more filters present in any candidate set to be placed last. In some embodiments, the one or more tags, the one or more log sources, the