KR-20260066706-A - Feature selection using feature distributions from different subsets for machine learning in computer security applications
Abstract
A diverse collection of data samples harvested for computer security applications is divided into multiple training corpora based on criteria, particularly the identity of the data source. The initial feature set intended to characterize the collected data samples is reduced to an optimal subset. For each candidate feature, the frequency distribution of the feature values is determined across each training corpus. The feature selection process favors features with relatively similar frequency distributions across multiple corpora. The detector module is then trained to detect computer security threats based on the reduced feature set.
Inventors
- 스메우, 스테판
- 부르차누, 엘레나
- 할레르, 에마누엘라
Assignees
- 비트데펜더 아이피알 매니지먼트 엘티디
Dates
- Publication Date
- 20260512
- Application Date
- 20240911
- Priority Date
- 20231230
Claims (20)
- A method configured to select a reduced subset of features from a plurality of features available for characterizing a data sample, wherein the selection of the reduced subset of features comprises: dividing a set of data samples obtained from a plurality of computing devices into a plurality of training corpora; selecting a candidate feature from the plurality of features; determining a first frequency distribution of feature values of the candidate feature across members of a first training corpus among the plurality of training corpora; determining a second frequency distribution of feature values of the candidate feature across a second training corpus among the plurality of training corpora; and determining whether to include the candidate feature in the reduced subset of features based on similarity between the first and second frequency distributions—and A computer system comprising at least one hardware processor configured to train a threat detector to determine whether a target data sample represents a computer security threat according to the reduced subset of the above features in response to selecting a reduced subset of the above features.
- In paragraph 1, A computer system characterized in that the above-mentioned at least one hardware processor is configured to divide a set of data samples into a plurality of training corpora according to the location of a computing device providing each respective data sample.
- In paragraph 1, A computer system characterized in that the above-mentioned at least one hardware processor is configured to divide the set of data samples into the plurality of training corpora according to the identity of the user of the computing device providing each corresponding data sample.
- In paragraph 1, A computer system characterized in that at least one hardware processor is configured to divide a set of data samples into a plurality of training corpora according to the acquisition time of each corresponding data sample.
- In paragraph 1, A computer system characterized in that the above-mentioned at least one hardware processor is configured to divide the set of data samples into the plurality of training corpora according to the device type of the computing device providing each corresponding data sample.
- In paragraph 1, A computer system characterized in that the plurality of computing devices are divided among the owners of the plurality of enterprises, and the at least one hardware processor is configured to divide the set of data samples into the plurality of training corpora according to the owner of the computing device providing each corresponding data sample.
- In paragraph 1, A computer system characterized in that at least one hardware processor is configured to divide a set of data samples into a plurality of training corpora according to a software profile of a computing device that provides each corresponding data sample, and the software profile includes a set of computer programs installed for execution on each computing device.
- In paragraph 1, Determining whether to include the above candidate features in a reduced subset of the above features is, Determining multiple similarity measures—wherein each of the multiple similarity measures quantifies the similarity between pairs of frequency distributions of the values of the candidate features, and each of the pairs of probability measures is evaluated across separate corpora of the multiple training corpora—; and A computer system characterized by further including selecting the candidate features as a reduced subset of the features according to the average of the plurality of similarity measures.
- In paragraph 8, A computer system characterized in that the above at least one hardware processor is further configured to select the candidate features into a reduced subset of the features according to the dispersion of the plurality of similarity measures.
- In paragraph 1, Determining whether to include the above candidate features in a reduced subset of the above features is, For each of the above multiple features, evaluating the feature-specific frequency distribution of the feature values of each of the above features across the first training corpus, Ranking the plurality of features according to the evaluated feature-specific frequency distribution, and A computer system characterized by further including selecting the candidate features from a reduced subset of the features according to the results of the above ranking.
- In paragraph 1, A computer system characterized by the above plurality of features being automatically configured by a machine learning procedure comprising training another threat detector to identify members of the set of data samples representing the computer security threat.
- To select a reduced subset of features from a plurality of features available for characterizing a data sample—wherein selecting the reduced subset of features comprises: dividing a set of data samples obtained from a plurality of computing devices into a plurality of training corpora; selecting a candidate feature from the plurality of features; determining a first frequency distribution of feature values of the candidate feature across members of a first training corpus among the plurality of training corpora; determining a second frequency distribution of feature values of the candidate feature across a second training corpus among the plurality of training corpora; and determining whether to include the candidate feature in the reduced subset of features based on similarity between the first and second frequency distributions—, and A computer security method comprising employing at least one hardware processor of a computer system to train a threat detector to determine whether a target data sample represents a computer security threat according to the reduced subset of the above features in response to selecting a reduced subset of the above features.
- In Paragraph 12, A computer security method characterized by including dividing a set of data samples into a plurality of training corpora according to the location of a computing device providing each corresponding data sample.
- In Paragraph 12, A computer security method characterized by including dividing a set of data samples into a plurality of training corpora according to the identity of the user of the computing device providing each corresponding data sample.
- In Paragraph 12, A computer security method characterized by including dividing a set of data samples into a plurality of training corpora according to the acquisition time of each corresponding data sample.
- In Paragraph 12, A computer security method characterized by including dividing a set of data samples into a plurality of training corpora according to the device type of a computing device providing each corresponding data sample.
- In Paragraph 12, A computer security method comprising dividing a set of data samples into a plurality of training corpora according to a software profile of a computing device that provides each corresponding data sample, wherein the software profile comprises a set of computer programs installed for execution on each computing device.
- In Paragraph 12, A computer security method characterized in that the plurality of computing devices are divided among a plurality of corporate owners, and the method comprises dividing a set of data samples into the plurality of training corpora according to the owner of the computing device providing each corresponding data sample.
- In Paragraph 12, Determining whether to include the above candidate features in a reduced subset of the above features is, Determining multiple similarity measures—wherein each of the multiple similarity measures quantifies the similarity between pairs of frequency distributions of the candidate feature values, and each pair of probability measures is evaluated across separate corpora of the multiple training corpora—; and A computer security method characterized by further including selecting the candidate features from a reduced subset of the features according to the average of the plurality of similarity measures.
- In Paragraph 19, A computer security method characterized by additionally including selecting the candidate features from a reduced subset of the features according to the dispersion of the plurality of similarity measures.
Description
Feature selection using feature distributions from different subsets for machine learning in computer security applications Cross-reference of related applications This application claims priority to the filing date of United States Provisional Patent Application No. 63/582,278 (titled “Feature Selection for Robust Anomaly Detectors”) filed on September 13, 2023, the entire contents of which are incorporated herein by reference. The present invention relates to machine learning, and more particularly to the optimal selection of input features for a classifier used in computer security applications, including the detection of malicious software, intrusions, and online fraud. Computer security is a major field of information technology aimed at protecting users and computing appliances from malicious software, intrusions, and fraudulent use. Malicious software (malware), in many forms such as computer viruses, spyware, and ransomware, affects millions of devices, making them vulnerable to fraud, loss of data and sensitive information, identity theft, and loss of productivity, among other things. Another persistent threat stems from online fraud, particularly in the form of phishing and identity theft. Sensitive identity information, such as usernames, IDs, passwords, social security and medical records, and bank and credit card details, fraudulently obtained by international criminal networks operating on the internet, is used to withdraw personal funds or/or is sold to third parties. In addition to direct financial damage to individuals, online fraud also causes various negative side effects on the economy, such as increased corporate security costs, higher retail prices and bank fees, falling stock prices, lower wages, and reduced tax revenue. The explosive growth of mobile computing has only exacerbated computer security risks, as millions of devices such as smartphones and tablet computers are constantly connected to the internet, becoming potential targets for malware and fraud attempts. Various computer security methods and software can be used to protect users and computers from these threats. Modern systems and methods are increasingly relying on pre-trained artificial intelligence (AI) to distinguish between malicious and benign samples. A typical example of an AI-based malware detector includes a neural network configured to receive a vector of feature values characterizing input samples and generate an output indicating whether each sample is malicious or not. However, AI-based methods face significant technical challenges of their own. One example is the selection of input features. Generally, there are no clear or universal criteria for selecting target software features that are more likely to reveal malice and/or distinguish between malicious and harmless behavior. Furthermore, there is no single set of malware indicator features that can reliably operate across a highly heterogeneous collection of devices, such as desktop computers, mobile computing platforms (smartphones, wearables, etc.), and Internet of Things (IoT) appliances. The problem is further complicated by deliberate attempts by sophisticated malware to evade detection. Malware can tailor its behavior based on device type (e.g., smartphone vs. tablet, one manufacturer or model vs. another), operating system type, and the current geographic location of each device. Some malware further selects its victims by scanning each device to find indicators of the user's value to the attacker. For example, malware can determine what other software is currently installed on each device and search for specific applications such as banking or social media. Other malware can monitor user patterns regarding access to various applications, online resources, and the like. Then, this malware can launch attacks only against carefully selected devices and/or carefully selected users when each attack appears more likely to succeed. To address the variability of malware and the heterogeneity of host devices, some conventional approaches substantially increase the number of features and consequently the size of the AI model in an attempt to improve performance. However, large-scale neural networks are notoriously expensive to implement and train, and generally require massive training corpora that are difficult to acquire, annotate, and maintain. Another common approach uses unsupervised training, where the AI system is configured to construct its own feature set based on the available training corpus. However, these self-generated features are generally uninformative to human users, and there is no guarantee that they will perform as expected when applied to data samples that the AI system has not previously seen. The aforementioned embodiments and advantages of the present invention will be better understood by reading the following detailed description and referring to the following drawings: FIG. 1 shows a plurality of client devices protected from co