US-20260127427-A1 - RETRAINING A CLASSIFIER MACHINE LEARNING MODEL TO IMPROVE ACCURACY OF DETECTION

US20260127427A1US 20260127427 A1US20260127427 A1US 20260127427A1US-20260127427-A1

Abstract

Provided are a computer implemented method, system, and computer program product for retraining a classifier machine learning model to improve accuracy of detection. A feature vector is received that is classified by the classifier as having a first classification result. A determination is made of a second classification result for the received feature vector based on labeled feature vectors having labeled classification results. The classifier is retrained to output the second classification result from input comprising the received feature vector in response to determining that the first classification result is different from the second classification result.

Inventors

Roman Alexander Pletka
Dionysios Diamantopoulos
Charalampos Pozidis
Yves Alexandre Beraldo dos Santos
Andrew D. Walls

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241104

Claims (20)

1 . A computer implemented method for retraining a classifier comprising a machine learning model, comprising: receiving a feature vector classified by the classifier having a first classification result; determining a second classification result for the received feature vector based on labeled feature vectors having labeled classification results; and retraining the classifier to output the second classification result from input comprising the received feature vector in response to determining that the first classification result is different from the second classification result.
2 . The computer implemented method of claim 1 , wherein the determining the second classification result based on the labeled feature vectors comprises: clustering labeled feature vectors to determine a representative labeled feature vector of a cluster of the labeled feature vectors; determining a similarity score between the representative labeled feature vector and the received feature vector; and determining whether the similarity score exceeds a similarity threshold indicating a degree of similarity, wherein the second classification result comprises a label of the representative labeled feature vector.
3 . The computer implemented method of claim 2 , wherein the determining the similarity score comprises measuring a distance between the representative labeled feature vector and the received feature vector in a vector space.
4 . The computer implemented method of claim 1 , further comprising: inputting the received feature vector through a plurality of classifiers to determine classification results; selecting at least one of the classifiers; forming a filtered set of feature vectors and classification results from the selected at least one classifier, wherein the second classification result is determined for the feature vectors in the filtered set; and adding the feature vectors in the filtered set having the first classification result different from the second classification result to a training set, wherein the retraining the classifier to output the second classification result is performed for the feature vectors in the training set.
5 . The computer implemented method of claim 1 , wherein the determining the second classification result based on the labeled feature vectors comprises: providing a base classifier model trained with a labeled training set to classify an input feature vector to a classification result, wherein the base classifier model comprises a more extensive neural network requiring greater memory and computational resources than the classifier, wherein the second classification result is outputted by the base classifier model.
6 . The computer implemented method of claim 1 , wherein the determining the second classification result based on the labeled feature vectors comprises: using retrieval augmented classification to determine an augmented labeled feature vector in a vector database most similar to the received feature vector; and providing a base classifier model trained with a labeled training set to classify an input feature vector and the augmented labeled feature vector as a classification result, wherein the second classification result is outputted by the base classifier model.
7 . The computer implemented method of claim 1 , wherein the determining the second classification result based on the labeled feature vectors comprises: embedding the received feature vector into an embedded feature vector in a vector space; embedding the labeled feature vectors into embedded labeled feature vectors; inputting the embedded feature vector and the labeled feature vectors into an unsupervised machine learning model to perform clustering to determine an embedded labeled feature vector closest to the embedded feature vector; and determining whether a distance between the embedded feature vector and the closest embedded labeled feature vector is within a distance threshold, wherein the second classification result comprises a label of the closest embedded labeled feature vector.
8 . A system for retraining a classifier comprising a machine learning model, comprising: a processor; and a computer readable storage medium including program instructions that when executed by the processor causes operations, the operations comprising: receiving a feature vector classified by the classifier having a first classification result; determining a second classification result for the received feature vector based on labeled feature vectors having labeled classification results; and retraining the classifier to output the second classification result from input comprising the received feature vector in response to determining that the first classification result is different from the second classification result.
9 . The system of claim 8 , wherein the determining the second classification result based on the labeled feature vectors comprises: clustering labeled feature vectors to determine a representative labeled feature vector of a cluster of the labeled feature vectors; determining a similarity score between the representative labeled feature vector and the received feature vector; and determining whether the similarity score exceeds a similarity threshold indicating a degree of similarity, wherein the second classification result comprises a label of the representative labeled feature vector.
10 . The system of claim 9 , wherein the determining the similarity score comprises measuring a distance between the representative labeled feature vector and the received feature vector in a vector space.
11 . The system of claim 8 , further comprising: inputting the received feature vector through a plurality of classifiers to determine classification results; selecting at least one of the classifiers; forming a filtered set of feature vectors and classification results from the selected at least one classifier, wherein the second classification result is determined for the feature vectors in the filtered set; and adding the feature vectors in the filtered set having the first classification result different from the second classification result to a training set, wherein the retraining the classifier to output the second classification result is performed for the feature vectors in the training set.
12 . The system of claim 8 , wherein the determining the second classification result based on the labeled feature vectors comprises: providing a base classifier model trained with a labeled training set to classify an input feature vector to a classification result, wherein the base classifier model comprises a more extensive neural network requiring greater memory and computational resources than the classifier, wherein the second classification result is outputted by the base classifier model.
13 . The system of claim 8 , wherein the determining the second classification result based on the labeled feature vectors comprises: using retrieval augmented classification to determine an augmented labeled feature vector in a vector database most similar to the received feature vector; and providing a base classifier model trained with a labeled training set to classify an input feature vector and the augmented labeled feature vector as a classification result, wherein the second classification result is outputted by the base classifier model.
14 . The system of claim 8 , wherein the determining the second classification result based on the labeled feature vectors comprises: embedding the received feature vector into an embedded feature vector in a vector space; embedding the labeled feature vectors into embedded labeled feature vectors; inputting the embedded feature vector and the labeled feature vectors into an unsupervised machine learning model to perform clustering to determine an embedded labeled feature vector closest to the embedded feature vector; and determining whether a distance between the embedded feature vector and the closest embedded labeled feature vector is within a distance threshold, wherein the second classification result comprises a label of the closest embedded labeled feature vector.
15 . A computer program product for retraining a classifier comprising a machine learning model, comprising a computer readable storage medium including program instructions that when executed by a processor perform operations, the operations comprising: receiving a feature vector classified by the classifier having a first classification result; determining a second classification result for the received feature vector based on labeled feature vectors having labeled classification results; and retraining the classifier to output the second classification result from input comprising the received feature vector in response to determining that the first classification result is different from the second classification result.
16 . The computer program product of claim 15 , wherein the determining the second classification result based on the labeled feature vectors comprises: clustering labeled feature vectors to determine a representative labeled feature vector of a cluster of the labeled feature vectors; determining a similarity score between the representative labeled feature vector and the received feature vector; and determining whether the similarity score exceeds a similarity threshold indicating a degree of similarity, wherein the second classification result comprises a label of the representative labeled feature vector.
17 . The computer program product of claim 15 , wherein the operations further comprise: inputting the received feature vector through a plurality of classifiers to determine classification results; selecting at least one of the classifiers; forming a filtered set of feature vectors and classification results from the selected at least one classifier, wherein the second classification result is determined for the feature vectors in the filtered set; and adding the feature vectors in the filtered set having the first classification result different from the second classification result to a training set, wherein the retraining the classifier to output the second classification result is performed for the feature vectors in the training set.
18 . The computer program product of claim 15 , wherein the determining the second classification result based on the labeled feature vectors comprises: providing a base classifier model trained with a labeled training set to classify an input feature vector to a classification result, wherein the base classifier model comprises a more extensive neural network requiring greater memory and computational resources than the classifier, wherein the second classification result is outputted by the base classifier model.
19 . The computer program product of claim 15 , wherein the determining the second classification result based on the labeled feature vectors comprises: using retrieval augmented classification to determine an augmented labeled feature vector in a vector database most similar to the received feature vector; and providing a base classifier model trained with a labeled training set to classify an input feature vector and the augmented labeled feature vector as a classification result, wherein the second classification result is outputted by the base classifier model.
20 . The computer program product of claim 15 , wherein the determining the second classification result based on the labeled feature vectors comprises: embedding the received feature vector into an embedded feature vector in a vector space; embedding the labeled feature vectors into embedded labeled feature vectors; inputting the embedded feature vector and the labeled feature vectors into an unsupervised machine learning model to perform clustering to determine an embedded labeled feature vector closest to the embedded feature vector; and determining whether a distance between the embedded feature vector and the closest embedded labeled feature vector is within a distance threshold, wherein the second classification result comprises a label of the closest embedded labeled feature vector.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer implemented method, system, and computer program product for retraining a classifier machine learning model to improve accuracy of detection. 2. Description of the Related Art Ransomware is a type of malware that is deployed to infiltrate a computer system and encrypts user data. The malevolent actor will then demand payment of money or a ransom to have the data unencrypted. A network intrusion detection system scans traffic on a network to detect malicious traffic containing ransomware. Machine learning based ransomware detection may use low-level memory access patterns at storage devices in a storage controller to detect presence of ransomware accessing the storage devices. SUMMARY Provided are a computer implemented method, system, and computer program product for retraining a classifier machine learning model to improve accuracy of detection. A feature vector is received that is classified by the classifier as having a first classification result. A determination is made of a second classification result for the received feature vector based on labeled feature vectors having labeled classification results. The classifier is retrained to output the second classification result from input comprising the received feature vector in response to determining that the first classification result is different from the second classification result. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an embodiment of a computing environment to retrain a classifier machine learning model to improve accuracy of classification. FIG. 2 illustrates an embodiment of a feature vector of I/O operation information gathered at storage devices. FIGS. 3, 4, and 5 illustrate embodiments of a similarity analyzer to retrain a classifier machine learning model. FIGS. 6, 7, 8, 9, and 10 illustrate embodiments of operations performed by a similarity analyzer to determine whether a feature vector is similar to a labeled feature vector. FIG. 11 illustrates a computing environment in which the components of FIGS. 1, 3, 4, and 5 may be implemented. DETAILED DESCRIPTION Classifier machine learning models may be used to classify an occurrence of a harmful event, e.g., presence of ransomware, from input comprising features of system operations. However, the classifier may produce false positives indicating a harmful event when no such event happened or false negatives not indicating a harmful event when such an event did happen. In response to regular false positives, administrators may ignore classifications of harmful events after unnecessarily expend time and resources responding to a series of misclassified harmful events. Described embodiments provide improvements to computer technology to retrain a classifier to reduce the incidence of incorrect classifications, such as false positives or false negatives. Feature vectors of attributes of system operations that are inputted to deployed classifiers may be gathered from systems implementing the classifier. For feature vectors that resulted in a classification of a harmful event, such as the presence of ransomware, a determination is made of a labeled feature vector that is most similar to the feature vector that resulted in the harmful classification. The labeled feature vector is labeled with a ground truth value indicating whether the labeled feature vector is associated with the harmful event or not. If the most similar labeled feature vector has a label indicating an absence of the harmful event, then the classifier wrongly classified the harmful event from the feature vector. The misclassified feature vector may be added to a training set to use to retrain the classifier to output indication of no harmful event or to output a classification different from the misclassification. This retraining reduces the likelihood that the retrained classifier will output in the future a false positive of indication of a harmful event from similar feature vector input. In this way, described embodiments improve the classifier classifications by reducing the incidence of false classifications and making the classifier more reliable and accurate. FIG. 1 illustrates an embodiment of a model training system 100 to train a classifier 102 machine learning model for deployment to storage controllers 104 providing access to a plurality of storage devices 106. Each of the storage devices 106 include a feature extraction engine 107 that gathers features of Input/Output (I/O) operation measurements or performance data at the storage device 106. The feature extraction engine 107 transmits the collected information to a feature extraction manager 108 in the storage controller 104. The feature extraction manger 108 aggregates the extracted features from the storage devices into vectors 200 that are provided to the classifier 102, e.g., inference engine. The classifier 102 outputs a classification 110 indicating,