US-12619721-B2 - Methods and apparatus for malware classification through convolutional neural networks using raw bytes

US12619721B2US 12619721 B2US12619721 B2US 12619721B2US-12619721-B2

Abstract

Methods, apparatus, systems, and articles of manufacture are disclosed. An example apparatus includes at least one memory, instructions; and processor circuitry to execute the instructions to train a neural network with a plurality of raw byte data samples, perform feature extraction on ones of the plurality of raw byte data samples, determine whether ones of the plurality of raw byte data samples are clean or malicious using the extracted features, and determine a family of malware to which an identified malicious sample belongs.

Inventors

Yonghong Huang
Steven Grobman
Jonathan King
Craig Schmugar
Abhishek Karnik
CELESTE FRALICK
Vitaly Zaytsev

Assignees

MCAFEE, LLC

Dates

Publication Date: 20260505
Application Date: 20220405

Claims (20)

1 . An apparatus comprising: at least one memory; instructions; and at least one processor circuit to execute the instructions to: train a first neural network with a plurality of raw byte data samples, wherein the plurality of raw byte data samples include both clean and malicious samples, wherein the plurality of raw byte data samples are not preprocessed; perform feature extraction using one or more raw byte data samples of the plurality of raw byte data samples and a rectified linear unit (RELU) to obtain extracted features, the extracted features including malware static features; determine whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious using the extracted features; and provide the malware static features as input to a second neural network to determine a family of malware to which an identified malicious sample belongs by using a scaled exponential linear unit (SELU).
2 . The apparatus of claim 1 , wherein a feature-based classifier is used to determine whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious.
3 . The apparatus of claim 1 , wherein a feature-based classifier is used to determine the family of malware to which the identified malicious sample belongs.
4 . The apparatus of claim 1 , wherein the first neural network is trained using a supervised learning algorithm such as one or more of a Regression, Decision Tree, Random forest, k-nearest neighbors (KNN), or Logistic Regression algorithm.
5 . The apparatus of claim 1 , wherein the plurality of raw byte data samples is deduplicated prior to use in training of the first neural network.
6 . The apparatus of claim 1 , wherein the RELU is used as a nonlinear activation function for learning a set of first and second convolutional layers of the first neural network.
7 . The apparatus of claim 1 , wherein the SELU is used as a nonlinear activation function for learning a set of fully connected layers of a feature-based classifier.
8 . A non-transitory computer readable medium comprising a plurality of instructions that, when executed, cause a machine to at least: train a first neural network with a plurality of raw byte data samples, the plurality of raw byte data samples include both clean and malicious samples, the plurality of raw byte data samples are not preprocessed; perform feature extraction using one or more raw byte data samples of the plurality of raw byte data samples and a rectified linear unit (RELU) to obtain extracted features, the extracted features including malware static features; determine whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious using the extracted features; and provide the malware static features as input to a second neural network to determine a family of malware to which an identified malicious sample belongs by using a scaled exponential linear unit (SELU).
9 . The non-transitory computer readable medium of claim 8 , wherein a feature-based classifier is used to determine whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious.
10 . The non-transitory computer readable medium of claim 8 , wherein a feature-based classifier is used to determine the family of malware to which the identified malicious sample belongs.
11 . The non-transitory computer readable medium of claim 8 , wherein the first neural network is trained using a supervised learning algorithm such as one or more of a Regression, Decision Tree, Random forest, k-nearest neighbors (KNN), or Logistic Regression algorithm.
12 . The non-transitory computer readable medium of claim 8 , wherein the plurality of raw byte data samples is deduplicated prior to use in training of the first neural network.
13 . The non-transitory computer readable medium of claim 8 , wherein the RELU is used as a nonlinear activation function for learning a set of first and second convolutional layers of the first neural network.
14 . A method to perform malware classification through convolutional neural networks using raw bytes, the method comprising: training a first neural network with a plurality of raw byte data samples, the plurality of raw byte data samples include both clean and malicious samples, the plurality of raw byte data samples are not preprocessed; performing feature extraction using one or more raw byte data samples of the plurality of raw byte data samples and a rectified linear unit (RELU) to obtain extracted features, the extracted features including malware static features; determining whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious using the extracted features; and providing the malware static features as input to a second neural network to determine a family of malware to which an identified malicious sample belongs by using a scaled exponential linear unit (SELU).
15 . The method of claim 14 , wherein a feature-based classifier is used to determine whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious.
16 . The method of claim 14 , wherein a feature-based classifier is used to determine the family of malware to which the identified malicious sample belongs.
17 . The method of claim 14 , wherein the first neural network is trained using a supervised learning algorithm such as one or more of a Regression, Decision Tree, Random forest, k-nearest neighbors (KNN), or Logistic Regression algorithm.
18 . The method of claim 14 , wherein the plurality of raw byte data samples is deduplicated prior to use in training of the first neural network.
19 . The method of claim 14 , wherein the RELU is used as a nonlinear activation function for learning a set of first and second convolutional layers of the first neural network.
20 . An apparatus to perform malware classification through convolutional neural networks using raw bytes comprising: interface circuitry; and processor circuitry including one or more of: at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations according to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus; a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations; or Application Specific Integrated Circuitry (ASIC) including logic gate circuitry to perform one or more third operations; the processor circuitry to perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: neural network training circuitry to train a first neural network with a plurality of raw byte data samples, the plurality of raw byte data samples include both clean and malicious samples, the plurality of raw byte data samples are not preprocessed; feature extraction circuitry to perform feature extraction using one or more raw byte data samples of the plurality of raw byte data samples and a rectified linear unit (RELU) to obtain extracted features, the extracted features including malware static features; sample classification circuitry to determine whether the one or more raw byte data samples of the plurality of raw byte data samples are clean or malicious using the extracted features; and malware family classification circuitry to determine a family of malware to which an identified malicious sample belongs and by using the malware static features as input to a second neural network and a scaled exponential linear unit (SELU).

Description

RELATED APPLICATION This patent arises from a continuation of U.S. Patent Application Ser. No. 63/170,647, which was filed on Apr. 5, 2021. U.S. Provisional Patent Application No. 63/170,647 is hereby incorporated herein by reference in its entirety. Priority to U.S. Provisional Patent Application No. 63/170,647 is hereby claimed. FIELD OF THE DISCLOSURE This disclosure relates generally to malware classification and, more particularly, to methods and apparatus for malware classification through convolutional neural networks (CNNs) using raw bytes. BACKGROUND The introduction of malware into regular software has grown rapidly over the recent years. The ability to classify and categorize malware and benign software is an important function of security programs. Polymorphic malware refers to malware that can change appearance and/or signature files in order to avoid detection. When training a machine learning (ML) model to be able to distinguish between malicious and clean software, an ability to process polymorphic malware is important for ensuring accuracy of sample classification by the machine learning model in deployment. BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1C are block diagrams of an example raw byte classification system to classify samples as clean or malicious and/or to extract features. FIG. 2 is an example depiction of a neural network architecture for the example raw byte classification system of FIGS. 1A, 1B, and/or 1C. FIG. 3 is a block diagram of an example implementation of the raw byte classification system of FIGS. 1A-1C. FIG. 4 is a representation of deduplication family data prevalence for malware family classification. FIGS. 5A and 5B depict example t-Distributed Stochastic Neighbor Embedding (t-SNE) and Principal Component Analysis (PCA) pre-training and post-training plots. FIGS. 6A and 6B depict example learning curves for malware classification. FIG. 7 shows a last-layer t-SNE plot visualizing output of the neural network after training is complete. FIG. 8 depicts deduplicated data prevalence across file types for binary classification. FIG. 9 shows example t-Distributed Stochastic Neighbor Embedding (t-SNE) pretraining and post training plots for binary classification. FIGS. 10A and 10B illustrate example first and second binary learning curves for binary classification. FIG. 11 shows an example Receiver Operating Characteristic (ROC) curve indicating classification performance of the neural network. FIGS. 12A-12C show example first, second, and third ROC curves for different data sample inputs. FIG. 13 is a flowchart representative of example machine readable instructions that may be executed by example processor circuitry to implement the example raw byte classification circuitry of FIG. 3, in accordance with the teachings of this disclosure. FIG. 14 is a block diagram of an example processing platform including processor circuitry structured to execute the example machine readable instructions of FIG. 13 to implement the raw byte classification circuitry 300 of FIG. 3. FIG. 15 is a block diagram of an example implementation of the processor circuitry of FIG. 14. FIG. 16 is a block diagram of another example implementation of the processor circuitry of FIG. 14. FIG. 17 is a block diagram of an example software distribution platform (e.g., one or more servers) to distribute software (e.g., software corresponding to the example machine readable instructions of FIG. 13) to client devices associated with end users and/or consumers (e.g., for license, sale, and/or use), retailers (e.g., for sale, re-sale, license, and/or sub-license), and/or original equipment manufacturers (OEMs) (e.g., for inclusion in products to be distributed to, for example, retailers and/or to other end users such as direct buy customers). The figures are not to scale. Instead, the thickness of the layers or regions may be enlarged in the drawings. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts. As used herein, unless otherwise stated, the term “above” describes the relationship of two parts relative to Earth. A first part is above a second part, if the second part has at least one part between Earth and the first part. Likewise, as used herein, a first part is “below” a second part when the first part is closer to the Earth than the second part. As noted above, a first part can be above or below a second part with one or more of: other parts therebetween, without other parts therebetween, with the first and second parts touching, or without the first and second parts being in direct contact with one another. As used herein, connection references (e.g., attached, coupled, connected, and joined) may include intermediate members between the elements referenced by the connection reference and/or relative movement between those elements unless otherwise indicated. As such, connection references do not necessa