RU-2026109233-A - Feature selection using feature distribution in different subsets for machine learning in computer security applications

RU2026109233ARU 2026109233 ARU2026109233 ARU 2026109233ARU-2026109233-A

Inventors

СМЕУ Стефан
БУРЧАНУ Елена
ХАЛЛЕР Эмануэла

Assignees

БИТДЕФЕНДЕР АйПиАр МЕНЕДЖМЕНТ ЛТД

Dates

Publication Date: 20260507
Application Date: 20240911
Priority Date: 20231230

Claims (20)

1. A computer system comprising at least one hardware processor configured to:
selecting a reduced subset of features from a set of features available for characterizing data samples, wherein the selection of a reduced subset of features comprises:
partitioning a set of data samples obtained from a plurality of computing devices into a plurality of training corpora;
selection of a candidate feature from a set of features,
determining a first frequency distribution of feature values of a candidate feature relative to members of a first training corpus from a plurality of training corpora,
determining a second frequency distribution of feature values of a candidate feature relative to a second training corpus from a plurality of training corpora, and
determining whether to include the candidate feature in the reduced feature subset according to the similarity between the first and second frequency distributions; and
in response to selecting the reduced subset of features, training a threat detector to determine whether a sample of target data is indicative of a computer security threat according to the reduced subset of features.
2. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the location of the computing device providing each corresponding data sample.
3. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the identification of the user of the computing device providing each corresponding data sample.
4. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the time of receipt of each corresponding data sample.
5. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the type of device of the computing device providing each corresponding data sample.
6. The computer system of claim 1, wherein the plurality of computing devices are divided among a plurality of corporate owners, and wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the owner of the computing device providing each corresponding data sample.
7. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with a software profile of the computing device providing each corresponding data sample, wherein the software profile comprises a set of computer programs installed for execution on the corresponding computing device.
8. The computer system of claim 1, wherein determining whether to include a candidate feature in the reduced subset of features further comprises:
defining a set of similarity measures, wherein each similarity measure in the set of similarity measures quantifies the similarity between a pair of frequency distributions of values of a candidate feature, wherein each of the pair of probability measures is evaluated relative to a separate corpus from the set of training corpora; and
selection of a candidate feature in a reduced subset of features in accordance with the average of a set of similarity measures.
9. The computer system of claim 8, wherein at least one hardware processor is configured to select a candidate feature in the reduced subset of features further in accordance with the variance of the plurality of similarity measures.
10. The computer system of claim 1, wherein determining whether to include a candidate feature in the reduced subset of features further comprises:
for each feature from the set of features, an estimate of the feature-dependent frequency distribution of feature values of the corresponding feature relative to the first training corpus;