US-20260129074-A1 - MACHINE LEARNING ARCHITECTURE FOR DETECTING MALICIOUS FILES USING STREAM OF DATA
Abstract
The present application discloses a method, system, and computer system for classifying stream data at an edge device. The method includes obtaining a stream of a file at the edge device, aligning a predetermined amount of data in chunks associated with the stream of the file, processing a plurality of aligned chunks associated with the stream of the file using a machine learning model, and classifying, at the edge device, the file based at least in part on a classification of the plurality of aligned chunks.
Inventors
- William Redington Hewlett II
- Sujit Rokka Chhetri
- Brody James Kutt
- Shan Huang
- Nandini Ramanan
- Sheng Yang
- Min Du
Assignees
- PALO ALTO NETWORKS, INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20250801
Claims (20)
- 1 . A system for performing classification at an edge device, comprising: one or more processors configured to: obtain a stream of a file at the edge device; obtain a set of chunks associated with the stream of the file; obtain a set of aligned chunks based at least in part on determining a set of file segments for the file, wherein at least one file segment of the set of file segments comprises information from at least two successive chunks of the set of chunks; and classify, at the edge device, the file based at least in part on a classification of one or more of the set of aligned chunks; and a memory coupled to the one or more processors and configured to provide the one or more processors with instructions.
- 2 . The system of claim 1 , wherein the edge device is a network device.
- 3 . The system of claim 1 , wherein the edge device is an inline security entity.
- 4 . The system of claim 1 , wherein the classification of the set of aligned chunks is obtained based at least in part on querying a machine learning model.
- 5 . The system of claim 4 , wherein the machine learning model is configured to classify whether the file is malicious.
- 6 . The system of claim 4 , wherein the file is classified using the machine learning model before an entirety of the file is processed.
- 7 . The system of claim 1 , wherein a first chunk of the set of chunks comprises overhead associated with the file.
- 8 . The system of claim 1 , wherein: determining a set of file segments for the file comprises: determining a first file segment based at least in part on associating a predetermined amount of a first chunk with a predetermined amount of a second chunk; classifying the file comprises querying the machine learning model based on the first file segment; and the file is classified based at least in part on a classification of the first file segment.
- 9 . The system of claim 1 , wherein: determining a set of file segments for the file comprises: determining an nth file segment based at least in part on associating a predetermined amount of an ith chunk with a predetermined amount of a jth chunk; i and j are positive integers, and j is greater than i; classifying the file comprises querying the machine learning model based on the nth file segment; and the file is classified based at least in part on a classification of the nth file segment.
- 10 . The system of claim 9 , wherein the nth file segment comprises a predetermined number of bytes.
- 11 . The system of claim 9 , wherein the nth file segment comprises 1500 bytes.
- 12 . The system of claim 9 , wherein the one or more processors are configured to select the predetermined number of bytes from among a set of preset numbers of bytes, the predetermined number of bytes being selected based on a packet size of the file.
- 13 . The system of claim 1 , wherein obtain a set of aligned chunks based on the set of chunks for the stream of the file adjusts for file overhead in a first chunk of the set of chunks and ensures that classification comprises processing a same number of bytes of the file for each alignment-adjusted chunk.
- 14 . The system of claim 1 , wherein the file is determined to be malicious if a prediction obtained from a machine learning model exceeds a predefined malicious threshold.
- 15 . The system of claim 14 , wherein the file is determined to be malicious after an nth file segment is processed using the machine learning model, n corresponds to a positive integer that is less than a total number of chunks in the file.
- 16 . The system of claim 14 , wherein the predefined malicious threshold is constant for each file segment in the file.
- 17 . The system of claim 14 , wherein the predefined malicious threshold is dynamic across classification of file segment in the file.
- 18 . The system of claim 14 , wherein in response to determining that the file is malicious, an active measure for malicious files is implemented.
- 19 . A method for performing classification at an edge device, comprising: obtaining, by one or more processors, a stream of a file at the edge device; obtaining a set of chunks associated with the stream of the file; obtaining a set of aligned chunks based at least in part on determining a set of file segments for the file, wherein at least one file segment of the set of file segments comprises information from at least two successive chunks of the set of chunks; and classifying, at the edge device, the file based at least in part on a classification of one or more of the set of aligned chunks.
- 20 . A computer program product embodied in a non-transitory computer readable medium for performing classification at an edge device, and the computer program product comprising computer instructions for: obtaining, by one or more processors, a stream of a file at the edge device; obtaining a set of chunks associated with the stream of the file; obtaining a set of aligned chunks based at least in part on determining a set of file segments for the file, wherein at least one file segment of the set of file segments comprises information from at least two successive chunks of the set of chunks; classifying, at the edge device, the file based at least in part on a classification of one or more of the set of aligned chunks.
Description
CROSS REFERENCE TO OTHER APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/104,137, entitled MACHINE LEARNING ARCHITECTURE FOR DETECTING MALICIOUS FILES USING STREAM OF DATA filed Jan. 31, 2023 which is incorporated herein by reference for all purposes. BACKGROUND OF THE INVENTION Nefarious individuals attempt to compromise computer systems in a variety of ways. As one example, such individuals may embed or otherwise include malicious files in email attachments and transmit or cause the malicious files to be transmitted to unsuspecting users. When executed, the malicious files compromise the victim's computer. Some types of malicious files will instruct a compromised computer to communicate with a remote host. For example, malicious files can turn a compromised computer into a “bot” in a “botnet,” receiving instructions from and/or reporting data to a command and control (C&C) server under the control of the nefarious individual. One approach to mitigating the damage caused by malicious files is for a security company (or other appropriate entity) to attempt to identify a malicious file and prevent it from reaching/executing on end user computers. Another approach is to try to prevent compromised computers from communicating with the C&C server. Unfortunately, authors of malicious files are using increasingly sophisticated techniques to obfuscate the workings of their software. Accordingly, there exists an ongoing need for improved techniques to detect malware and prevent its harm. BRIEF DESCRIPTION OF THE DRA WINGS Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings. FIG. 1 is a block diagram of an environment in which a malicious traffic is detected or suspected according to various embodiments. FIG. 2 is a block diagram of a system to classify a file according to various embodiments. FIG. 3 is a bock diagram of a method for classifying a model. FIG. 4 illustrates a system for classifying a streaming file based on a subset of chunks of the streaming file according to various embodiments. FIG. 5 illustrates a system for classifying a streaming file based on a subset of chunks of the streaming file according to various embodiments. FIG. 6 illustrates a for a graph of performance of file classification using a subset of chunks of a streaming file according to various embodiments. FIG. 7 is a flow diagram of a method for classifying a streaming file before processing the entirety of the streaming file according to various embodiments. FIG. 8 is a flow diagram of a method for classifying a streaming file before processing the entirety of the streaming file according to various embodiments. FIG. 9 is a flow diagram of a method for classifying a streaming file before processing the entirety of the streaming file according to various embodiments. FIG. 10 is a flow diagram of a method for training a classification model according to various embodiments. FIG. 11 is a diagram of a set of chunks associated with a streaming file according to various embodiments. FIG. 12 is a flow diagram of a method for detecting a malicious file according to various embodiments. FIG. 13 is a block diagram of a system for classifying a streaming file based on chunk data according to various embodiments. FIG. 14 is a block diagram of classification of a set of chunks obtained in a stream of a file according to various embodiments. FIG. 15 is a block diagram of classification of a set of chunks obtained in a stream of a file according to various embodiments. FIG. 16 is a flow diagram of a method for classifying a stream of a file according to various embodiments. FIG. 17 is a flow diagram of a method for detecting a malicious file according to various embodiments. FIG. 18 is a flow diagram of a method for detecting a malicious file according to various embodiments. FIG. 19 is a flow diagram of a method for detecting a malicious file according to various embodiments. DETAILED DESCRIPTION The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores