US-12619725-B1 - Systems and methods for malicious command line and script detection through deployment of generative artificial intelligence

US12619725B1US 12619725 B1US12619725 B1US 12619725B1US-12619725-B1

Abstract

Implementations of the disclosure are directed to configuring a pre-trained large language model (LLM) to be used for zero-day attack detection of log data, scripts, commands, operators, etc. The pre-trained LLM may be configured to generate probabilistic labels for data to be analyzed as being part of a cyberthreat or cyberattack. In some instances, generative artificial intelligence (GenAI) technologies may be utilized by or with the pre-trained LLM to generate the probabilistic labels. The probabilistic labels along with features extracted from the data may be provided a machine learning model, which may also receive behavioral analysis results from a user behavioral analytics system (e.g., baseline-based behavioral models) and generate a detection report. Feedback may be utilized in retraining the pre-trained LLM model. Additionally, GenAI techniques may be utilized to generate a natural language summary of the detection report.

Inventors

Cui Lin
Rodolfo J. Soto

Assignees

SPLUNK INC.

Dates

Publication Date: 20260505
Application Date: 20240916

Claims (20)

1 . A computer-implemented method, comprising: obtaining historical log data including a plurality of logs being recordings of activities or occurrences during operation of a network device, wherein each recording includes or involves a script or an executable file; obtaining synthetic log data including a plurality of synthetic logs being synthetic data representative of activities or occurrences during operation of the network device, wherein the synthetic data include or involve synthetic scripts or synthetic executable files that include a label indicating whether each synthetic script or synthetic executable file is suspicious or benign; deploying a first generative machine learning model by providing the historical log data and the synthetic log data as input, wherein the first generative machine learning model is trained and configured to generate training probabilistic labels indicating a first level of suspiciousness for each script or executable file of the historical log data; performing a re-training process or a fine-tuning process on a large language model (LLM) including: processing a batch of the plurality of logs of the historical log data to generate second probabilistic labels indicating a second level of suspiciousness for each script or executable file of the batch of the plurality of logs, determining a loss between the second probabilistic labels and corresponding labels of the training probabilistic labels, and adjusting weights or parameters of the LLM according to the loss; and storing the LLM following implementation of the re-training or the fine-tuning process.
2 . The computer-implemented method of claim 1 , wherein the script or the executable of each recording of the historical log data is a PowerShell script.
3 . The computer-implemented method of claim 1 , wherein the recordings of the historical log data are Windows Events.
4 . The computer-implemented method of claim 1 , further comprising: deploying the LLM by providing additional log data including a second plurality of logs being additional recordings of activities or occurrences during subsequent operation of the network device, wherein each recording includes or involves an additional script or an additional executable file.
5 . The computer-implemented method of claim 1 , further comprising: prior to deploying the first generative machine learning model, applying a set of security rules or a set of machine learning models to the recordings of activities or occurrences during operation of the network device resulting in a set of suspiciousness determinations, wherein the set of suspiciousness determinations are provided as part of the input.
6 . The computer-implemented method of claim 1 , further comprising: prior to deploying the first generative machine learning model, performing a data balancing procedure such that the first generative machine learning model is provided a dataset that is more balanced between benign and suspicious examples than the historical log data.
7 . The computer-implemented method of claim 1 , further comprising: performing an additional re-training procedure on the first generative machine learning model based on feedback received as user input.
8 . A computing device, comprising: a processor; and a non-transitory computer-readable medium having stored thereon instructions that, when executed by the processor, cause the processor to perform operations including: obtaining historical log data including a plurality of logs being recordings of activities or occurrences during operation of a network device, wherein each recording includes or involves a script or an executable file; obtaining synthetic log data including a plurality of synthetic logs being synthetic data representative of activities or occurrences during operation of the network device, wherein the synthetic data include or involve synthetic scripts or synthetic executable files that include a label indicating whether each synthetic script or synthetic executable file is suspicious or benign; deploying a first generative machine learning model by providing the historical log data and the synthetic log data as input, wherein the first generative machine learning model is trained and configured to generate training probabilistic labels indicating a first level of suspiciousness for each script or executable file of the historical log data; performing a re-training process or a fine-tuning process on a large language model (LLM) including: processing a batch of the plurality of logs of the historical log data to generate second probabilistic labels indicating a second level of suspiciousness for each script or executable file of the batch of the plurality of logs, determining a loss between the second probabilistic labels and corresponding labels of the training probabilistic labels, and adjusting weights or parameters of the LLM according to the loss; and storing the LLM following implementation of the re-training or the fine-tuning process.
9 . The computing device of claim 8 , wherein the script or the executable of each recording of the historical log data is a PowerShell script.
10 . The computing device of claim 8 , wherein the recordings of the historical log data are Windows Events.
11 . The computing device of claim 8 , wherein the operations further include: deploying the LLM by providing additional log data including a second plurality of logs being additional recordings of activities or occurrences during subsequent operation of the network device, wherein each recording includes or involves an additional script or an additional executable file.
12 . The computing device of claim 8 , wherein the operations further include: prior to deploying the first generative machine learning model, applying a set of security rules or a set of machine learning models to the recordings of activities or occurrences during operation of the network device resulting in a set of suspiciousness determinations, wherein the set of suspiciousness determinations are provided as part of the input.
13 . The computing device of claim 8 , wherein the operations further include: prior to deploying the first generative machine learning model, performing a data balancing procedure such that the first generative machine learning model is provided a dataset that is more balanced between benign and suspicious examples than the historical log data.
14 . The computing device of claim 8 , wherein the operations further include: performing an additional re-training procedure on the first generative machine learning model based on feedback received as user input.
15 . A non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processor to perform operations including: obtaining historical log data including a plurality of logs being recordings of activities or occurrences during operation of a network device, wherein each recording includes or involves a script or an executable file; obtaining synthetic log data including a plurality of synthetic logs being synthetic data representative of activities or occurrences during operation of the network device, wherein the synthetic data include or involve synthetic scripts or synthetic executable files that include a label indicating whether each synthetic script or synthetic executable file is suspicious or benign; deploying a first generative machine learning model by providing the historical log data and the synthetic log data as input, wherein the first generative machine learning model is trained and configured to generate training probabilistic labels indicating a first level of suspiciousness for each script or executable file of the historical log data; performing a re-training process or a fine-tuning process on a large language model (LLM) including: processing a batch of the plurality of logs of the historical log data to generate second probabilistic labels indicating a second level of suspiciousness for each script or executable file of the batch of the plurality of logs, determining a loss between the second probabilistic labels and corresponding labels of the training probabilistic labels, and adjusting weights or parameters of the LLM according to the loss; and storing the LLM following implementation of the re-training or the fine-tuning process.
16 . The non-transitory computer-readable medium of claim 15 , wherein the script or the executable of each recording of the historical log data is a PowerShell script, and wherein the recordings of the historical log data are Windows Events.
17 . The non-transitory computer-readable medium of claim 15 , wherein the operations further include: deploying the LLM by providing additional log data including a second plurality of logs being additional recordings of activities or occurrences during subsequent operation of the network device, wherein each recording includes or involves an additional script or an additional executable file.
18 . The non-transitory computer-readable medium of claim 15 , wherein the operations further include: prior to deploying the first generative machine learning model, applying a set of security rules or a set of machine learning models to the recordings of activities or occurrences during operation of the network device resulting in a set of suspiciousness determinations, wherein the set of suspiciousness determinations are provided as part of the input.
19 . The non-transitory computer-readable medium of claim 15 , wherein the operations further include: prior to deploying the first generative machine learning model, performing a data balancing procedure such that the first generative machine learning model is provided a dataset that is more balanced between benign and suspicious examples than the historical log data.
20 . The non-transitory computer-readable medium of claim 15 , wherein the operations further include: performing an additional re-training procedure on the first generative machine learning model based on feedback received as user input.

Description

RELATED APPLICATIONS Any and all applications for which a foreign or domestic priority claim is identified in the Application Data Sheet as filed with the present application are incorporated by reference under 37 CFR 1.57 and made a part of this specification. BACKGROUND PowerShell is a powerful scripting language and shell framework primarily used on network devices running a version of the WINDOWS® operating system. The PowerShell scripting language has been around for many years, is used by many system administrators, and is on track to replace the default command prompt within the WINDOWS® operating systems in the future. According to a research study by Symantec Corporation, nearly 95.4% of all scripts analyzed by the Blue Coat Sandbox offered by Symantec Corporation were malicious. A recent study also showed PowerShell scripts have become the attack technique that is most often used by threat actors. For example, the Red Canary study indicated that approximately 22% of its customers were affected in 2023 by a cyberattack involving PowerShell. Malicious PowerShell scripts may be predominantly used as downloaders, e.g., macros used with MICROSOFT OFFICE®, during the incursion phase of a cyber-attack. Another common use occurs during the lateral movement phase of a cyber-attack, allowing malicious code execution on a remote network when spreading inside the network. PowerShell scripts may also download and execute commands directly from memory, making it hard for forensics experts to trace the infection. Threat actors may generally use PowerShell to execute commands, evade detection, obfuscate malicious activity, spawn additional processes, remotely download and execute arbitrary code and binaries, gather information, and/or change system configurations. In some instances, PowerShell has been used by threat actors to disable Windows security tools. BRIEF DESCRIPTION OF THE DRAWINGS Illustrative examples are described in detail below with reference to the following figures: FIG. 1 is a block diagram illustrating a diagrammatic flow of the processing of data resulting in the fine-tuning of a large language model (LLM) configured to generate probabilistic labels as to the suspiciousness of PowerShell scripts according to an implementation of the disclosure; FIG. 2 is a flowchart illustrating example operations for performing a fine-tuning process of a large language model (LLM) configured to generate probabilistic labels as to the suspiciousness of PowerShell scripts according to an implementation of the disclosure; FIG. 3 is a block diagram illustrating a detailed diagrammatic flow of the generation of a dataset to be input into a label generation generative model according to an implementation of the disclosure; FIG. 4 is a flowchart illustrating example operations for generating probabilistic labels by a generative model from customer log data according to an implementation of the disclosure; FIG. 5 is a block diagram illustrating a diagrammatic flow of deployment of a plurality of large learning models (LLMs) and an anomaly detection system to determine a suspiciousness prediction of and a threat detection report directed to a PowerShell script according to an implementation of the disclosure; FIG. 6 is a flowchart illustrating example operations for generating probabilistic labels by a generative model from customer log data according to an implementation of the disclosure; FIG. 7 is a block diagram illustrating a diagrammatic flow of deployment of a large learning models (LLM), a user behavioral analytics system, and a PowerShell script threat detection model to determine a suspiciousness prediction of and a threat detection report directed to a PowerShell script and further a generative model configured to generate an interpretation of the threat detection report according to an implementation of the disclosure; FIG. 8 is a flowchart illustrating example operations for generating a threat detection report according to an implementation of the disclosure; FIG. 9 is a block diagram illustrating a deployment configuration of a networked environment including a plurality of models processing in a deep learning platform and other network components according to an implementation of the disclosure; FIG. 10 is a flowchart illustrating example operations for performing automated label generation and training operations of a large language model according to an implementation of the disclosure; FIG. 11 is a flowchart illustrating example operations for deploying a machine learning model configured to generate a malicious determination of a script or executable file according to an implementation of the disclosure; FIG. 12 is a block diagram illustrating an example computing environment that includes a data intake and query system according to an implementation of the disclosure; FIG. 13 is a block diagram illustrating in greater detail an example of an indexing system of a data intake and query system, such