US-12619429-B2 - Large language model-based software reverse engineering assistant

US12619429B2US 12619429 B2US12619429 B2US 12619429B2US-12619429-B2

Abstract

Systems and methods of utilizing a large language model (LLM) to reverse engineer software is provided. The method includes obtaining sample assembly language from coded information or data. The sample assembly language is input to a machine learning (ML) model trained to recognize when the sample assembly language includes malicious code. The method further includes identifying, from the sample assembly language, a functionality implemented by the sample assembly language, where the functionality is indicative of whether the sample assembly language includes the malicious code. The method further includes generating, by a processing device, a natural language indication of the functionality implemented by the sample assembly language. The natural language indication is an output of the ML model.

Inventors

Felix Schwyzer
Aditya Kapoor
Calin-Bogdan Miron
Marian Radu

Assignees

CROWDSTRIKE, INC.

Dates

Publication Date: 20260505
Application Date: 20231226

Claims (20)

1 . A method comprising: executing a first protocol for autonomous selection and submission of sample assembly language as an input to a machine learning (ML) model; obtaining, based on the first protocol, the sample assembly language from coded information or data, the ML model being trained to recognize when the sample assembly language includes malicious code; identifying, from the sample assembly language, a functionality implemented by the sample assembly language, the functionality being indicative of whether the sample assembly language includes the malicious code; and generating, by a processing device, a natural language indication of the functionality implemented by the sample assembly language, the natural language indication being an output of the ML model.
2 . The method of claim 1 , wherein the identifying the functionality implemented by the sample assembly language comprises: de-obfuscating the sample assembly language based on a de-obfuscation script generated with a higher-level language than the sample assembly language.
3 . The method of claim 1 , further comprising: retraining the ML model on an assembly language dataset based on updates to the assembly language dataset, the assembly language dataset including benign samples, malicious samples, obfuscated samples, and non-obfuscated samples.
4 . The method of claim 1 , wherein the identifying the functionality implemented by the sample assembly language, comprises: applying a reverse engineering procedure to the sample assembly language to generate de-obfuscated sample assembly language.
5 . The method of claim 4 , wherein the applying the reverse engineering procedure, comprises at least one of: executing a second protocol for autonomous selection of the reverse engineering procedure from a plurality of reverse engineering procedures, or receiving an explicit indication of the reverse engineering procedure.
6 . The method of claim 4 , further comprising: initiating an implementation of the de-obfuscated sample assembly language, the implementation comprising analyzing whether the sample assembly language includes the malicious code.
7 . The method of claim 1 , wherein the natural language indication output by the ML model further indicates a probability of the sample assembly language including the malicious code.
8 . The method of claim 1 , wherein the sample assembly language corresponds to a portable executable (PE) file.
9 . A system comprising: a processing device; and a memory to store instructions that, when executed by the processing device cause the processing device to: execute a first protocol for autonomous selection and submission of sample assembly language as an input to a machine learning (ML) model; obtain, based on the first protocol, the sample assembly language from coded information or data, the ML model being trained to recognize when the sample assembly language includes malicious code; identify, from the sample assembly language, a functionality implemented by the sample assembly language, the functionality being indicative of whether the sample assembly language includes the malicious code; and generate a natural language indication of the functionality implemented by the sample assembly language, the natural language indication being an output of the ML model.
10 . The system of claim 9 , wherein to identify the functionality implemented by the sample assembly language the processing device is further to: de-obfuscate the sample assembly language based on a de-obfuscation script generated with a higher-level language than the sample assembly language.
11 . The system of claim 9 , wherein the processing device is further to: retrain the ML model on an assembly language dataset based on updates to the assembly language dataset, the assembly language dataset including benign samples, malicious samples, obfuscated samples, and non-obfuscated samples.
12 . The system of claim 9 , wherein to identify the functionality implemented by the sample assembly language the processing device is further to: apply a reverse engineering procedure to the sample assembly language to generate de-obfuscated sample assembly language.
13 . The system of claim 12 , wherein to apply the reverse engineering procedure the processing device is to at least one of: execute a second protocol for autonomous selection of the reverse engineering procedure from a plurality of reverse engineering procedures, or receive an explicit indication of the reverse engineering procedure.
14 . The system of claim 12 , wherein the processing device is further to: initiate an implementation of the de-obfuscated sample assembly language, the implementation comprising analyzing whether the sample assembly language includes the malicious code.
15 . The system of claim 9 , wherein the natural language indication output by the ML model further indicates a probability of the sample assembly language including the malicious code.
16 . The system of claim 9 , wherein the sample assembly language corresponds to a portable executable (PE) file.
17 . A non-transitory computer-readable storage medium including instructions that, when executed by a processing device, cause the processing device to: execute a first protocol for autonomous selection and submission of sample assembly language as an input to a machine learning (ML) model; obtain, based on the first protocol, the sample assembly language from coded information or data, the ML model being trained to recognize when the sample assembly language includes malicious code; identify, from the sample assembly language, a functionality implemented by the sample assembly language, the functionality being indicative of whether the sample assembly language includes the malicious code; and generate, by the processing device, a natural language indication of the functionality implemented by the sample assembly language, the natural language indication being an output of the ML model.
18 . The non-transitory computer-readable storage medium of claim 17 , wherein to identify the functionality implemented by the sample assembly language the processing device is further to: de-obfuscate the sample assembly language based on a de-obfuscation script generated with a higher-level language than the sample assembly language.
19 . The non-transitory computer-readable storage medium of claim 17 , wherein the processing device is further to: retrain the ML model on an assembly language dataset based on updates to the assembly language dataset, the assembly language dataset including benign samples, malicious samples, obfuscated samples, and non-obfuscated samples.
20 . The non-transitory computer-readable storage medium of claim 19 , wherein the processing device is further to: execute a second protocol for autonomous selection of a reverse engineering procedure from a plurality of reverse engineering procedures, or receive an explicit indication of the reverse engineering procedure.

Description

TECHNICAL FIELD Aspects of the present disclosure relate to machine learning (ML) models, and more particularly, to large language models (LLMs) used for reverse engineering. BACKGROUND Large language models are designed to understand and generate coherent and contextually relevant text. Large language models are typically built using deep learning techniques using a neural network architecture and are trained on substantial amounts of text data for learning to generate responses. The training process for large language models involves exposing the model to vast quantities of text from various sources, such as books, articles, websites, and other data. Large language models use tokens as fundamental units into which text is divided for processing. Tokens are usually smaller units of text, such as individual characters, sub words (e.g., byte-pair encoding), or words. Large language models tokenize queries and general text documentation as part of their input processing, which enables large language models to manage large volumes of general text documentation efficiently. By breaking the text into tokens and representing text numerically, large language models can understand and generate responses based on the underlying patterns and relationships within the text. BRIEF DESCRIPTION OF THE DRAWINGS The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments. FIG. 1 is a block diagram that illustrates an example system for assembly language sample analysis, in accordance with some embodiments of the present disclosure. FIG. 2 is a flow diagram of a method for generating a natural language indication of functionality implemented by sample assembly language, in accordance with some embodiments of the present disclosure. FIG. 3 is a component diagram of an example of a device architecture for ML-based software reverse engineering assistance, in accordance with some embodiments of the present disclosure. FIG. 4 is a block diagram of an example computing device that may perform one or more of the operations described herein, in accordance with some embodiments of the disclosure. DETAILED DESCRIPTION Malware analysis techniques can include a static analysis of source code for determining whether the source code includes, or is accompanied by, malicious code. In this domain, static analysis refers to a reverse engineering procedure that allows for an inspection/examination of the source code without having to execute the source code. For example, when analyzing a portable executable (PE), the static analysis may include interpreting an assembly language representation of associated binary code. In this manner, malware such as file infectors that accompany the source code can be detected without compromising a host device or associated information through execution of an infected executable. Oftentimes, a significant amount of time is expended attempting to detect malicious functions at the assembly language representation level of the source code. Assembly language is a low-level programing language that simplifies binary instructions that are input to a processing device, such as a central processing unit (CPU). More specifically, assembly language is a human-readable abstraction (e.g., text characters) mapped to machine (e.g., binary) code, so that programmers do not have to manually count 1's and 0's to understand the code. However, even the assembly language may not be readily understood by some operators. Thus, if a malicious function is detected in the assembly language, such functionality may have to be indicated as a natural language expression in order for the functionality to be understood by the operator. In some cases, converting the assembly language to natural language expressions can be a manual procedure that is both time-consuming and tedious. Malware may be obfuscated by a malicious actor to make the malware more difficult to detect. When statically analyzing assembly language representations of malware samples, the analyzer may have to de-obfuscate the code or data. De-obfuscation routines identified in the sample may be reimplemented to de-obfuscate the code or data. For example, a higher-level language, such as Python, may be used to reimplement the code or data. Again, however, reimplementing de-obfuscation routines can involve a manual component that is both time-consuming and repetitive. Examples of the present disclosure address the above-noted and other deficiencies by providing an ML model, such as an LLM, that is trained to recognize when sample assembly language includes malicious code. The ML model automates the process of detecting relevant functions mapped to portions of the