RU-2026109233-A - Feature selection using feature distribution in different subsets for machine learning in computer security applications
RU2026109233ARU 2026109233 ARU2026109233 ARU 2026109233ARU-2026109233-A
Inventors
- СМЕУ Стефан
- БУРЧАНУ Елена
- ХАЛЛЕР Эмануэла
Assignees
- БИТДЕФЕНДЕР АйПиАр МЕНЕДЖМЕНТ ЛТД
Dates
- Publication Date
- 20260507
- Application Date
- 20240911
- Priority Date
- 20231230
Claims (20)
- 1. A computer system comprising at least one hardware processor configured to:
- selecting a reduced subset of features from a set of features available for characterizing data samples, wherein the selection of a reduced subset of features comprises:
- partitioning a set of data samples obtained from a plurality of computing devices into a plurality of training corpora;
- selection of a candidate feature from a set of features,
- determining a first frequency distribution of feature values of a candidate feature relative to members of a first training corpus from a plurality of training corpora,
- determining a second frequency distribution of feature values of a candidate feature relative to a second training corpus from a plurality of training corpora, and
- determining whether to include the candidate feature in the reduced feature subset according to the similarity between the first and second frequency distributions; and
- in response to selecting the reduced subset of features, training a threat detector to determine whether a sample of target data is indicative of a computer security threat according to the reduced subset of features.
- 2. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the location of the computing device providing each corresponding data sample.
- 3. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the identification of the user of the computing device providing each corresponding data sample.
- 4. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the time of receipt of each corresponding data sample.
- 5. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the type of device of the computing device providing each corresponding data sample.
- 6. The computer system of claim 1, wherein the plurality of computing devices are divided among a plurality of corporate owners, and wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with the owner of the computing device providing each corresponding data sample.
- 7. The computer system of claim 1, wherein at least one hardware processor is configured to divide the set of data samples into a plurality of training corpora in accordance with a software profile of the computing device providing each corresponding data sample, wherein the software profile comprises a set of computer programs installed for execution on the corresponding computing device.
- 8. The computer system of claim 1, wherein determining whether to include a candidate feature in the reduced subset of features further comprises:
- defining a set of similarity measures, wherein each similarity measure in the set of similarity measures quantifies the similarity between a pair of frequency distributions of values of a candidate feature, wherein each of the pair of probability measures is evaluated relative to a separate corpus from the set of training corpora; and
- selection of a candidate feature in a reduced subset of features in accordance with the average of a set of similarity measures.
- 9. The computer system of claim 8, wherein at least one hardware processor is configured to select a candidate feature in the reduced subset of features further in accordance with the variance of the plurality of similarity measures.
- 10. The computer system of claim 1, wherein determining whether to include a candidate feature in the reduced subset of features further comprises:
- for each feature from the set of features, an estimate of the feature-dependent frequency distribution of feature values of the corresponding feature relative to the first training corpus;