US-12619733-B2 - Optimizing accuracy of security alerts based on data classification
Abstract
A computing system and method for training one or more machine-learning models to perform anomaly detection. A training dataset is accessed. An overall sensitivity score is determined that indicates an amount of sensitive data in the training dataset. Machine-learning models are trained based on the training dataset and the overall sensitivity score. The machine-learning models use the overall sensitivity score to determine a threshold. The threshold is relatively low for datasets having a large amount of sensitive data and is relatively high for dataset having a small among of sensitive data. When executed, the machine-learning models determine if a probability score of features extracted from a received dataset are above the determined threshold when a second overall sensitivity score of the received dataset is substantially similar to the overall sensitivity score. When the probability score is above the determined threshold, the machine-learning models cause an alert to be generated.
Inventors
- Andrey Karpovsky
- Sagi LOWENHARDT
- Shimon Ezra
Assignees
- MICROSOFT TECHNOLOGY LICENSING, LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20220623
Claims (17)
- 1 . A method for a computing system to train one or more machine-learning models to perform anomaly detection, the method comprising: accessing a group of training datasets, the training datasets including a plurality of data items; determining overall sensitivity scores for the training datasets in the group, the overall sensitivity scores indicating an amount of sensitive data included in the training datasets and being determined based on individual sensitivity scores of the plurality of data items of the training datasets; and training one or more machine-learning models to perform anomaly detection based on the group of the training datasets and the overall sensitivity scores of the training datasets in the group, the one or more machine-learning models using the overall sensitivity scores to determine a respective threshold for a training dataset based on an inverse sliding scale relationship, wherein based on the inverse sliding scale relationship, the lower an overall sensitivity score is, the higher the respective threshold is, and the higher an overall sensitivity score is, the lower the respective threshold is, wherein: when an overall sensitivity score of a received dataset, which is distinct from the group of the training datasets, is closer to an overall sensitivity score of one training dataset in the group than overall sensitivity scores of the other training datasets in the group based on the inverse sliding scale relationship, the one or more machine-learning models are configured to determine if a probability score of one or more features extracted from the received dataset is above a threshold of the one training dataset, and in response to determining that the probability score is above the threshold of the one training dataset, causing an alert to be generated, the alert indicating that an anomaly has been detected.
- 2 . The method of claim 1 , further comprising: determining whether each of the plurality of data items is sensitive or is non-sensitive; and determining the overall sensitivity score based on the determination of whether each of the plurality of data items is sensitive or is non-sensitive.
- 3 . The method of claim 2 , wherein the overall sensitivity score is determined based on a total number of the data items that are determined to be sensitive in relation to the total number of data items in the training dataset.
- 4 . The method of claim 2 , wherein the dataset is scanned to determine whether each of the plurality of data items is sensitive or is non-sensitive.
- 5 . The method of claim 1 , further comprising: receiving feedback from a user agent in response to the user agent receiving the generated alert; and in response to the feedback, adjusting the threshold.
- 6 . The method of claim 5 , wherein adjusting the threshold comprises adjusting the overall sensitivity score.
- 7 . The method of claim 1 , further comprising: receiving feedback from a user agent in response to user interaction with the dataset; and in response to the feedback, adjusting the threshold.
- 8 . A method for a computing system to perform anomaly detection, the method comprising: executing one or more machine-learning models trained based on a group of training datasets, the training datasets including a plurality of data items, wherein training the one or more machine-learning models comprises: determining overall sensitivity scores for the training datasets in the group, the overall sensitivity scores indicating an amount of sensitive data included in the datasets and being determined based on individual sensitivity scores of the plurality of data items of the training datasets; and using the overall sensitivity scores to determine a respective threshold for a training dataset based on an inverse sliding scale relationship, wherein based on the inverse sliding scale relationship, the lower an overall sensitivity score is, the higher the respective threshold is, and the higher the overall sensitivity score is, the lower the respective threshold is; receiving a dataset, which is distinct from the group of the training datasets, at the computing system; when an overall sensitivity score of the received dataset is closer to an overall sensitivity score of one training dataset than overall sensitivity scores of the other training datasets in the group based on the inverse sliding scale relationship, determining if a probability score of one or more features extracted from the received dataset is above a threshold of the one training dataset; and in response to determining that the probability score is above the threshold of the one training dataset, generating an alert, the alert indicating that an anomaly has been detected.
- 9 . The method of claim 8 , further comprising: determining whether each of the plurality of data items is sensitive or is non-sensitive; and determining the overall sensitivity score based on the determination of whether each of the plurality of data items is sensitive or is non-sensitive.
- 10 . A computing system for training one or more machine-learning models to perform anomaly detection, comprising: one or more processors; and one or more computer-readable hardware storage devices having stored thereon computer-executable instructions that are structured such that, when executed by the one or more processors, the computer-executable instructions cause the computing system to perform at least: access a group of training datasets, the training datasets including a plurality of data items; determine overall sensitivity scores for the training datasets in the group, the overall sensitivity scores indicating an amount of sensitive data included in the training datasets and being determined based on individual sensitivity scores of the plurality of data items of the training datasets; and train one or more machine-learning models to perform anomaly detection based on the group of the training datasets and the overall sensitivity scores of the training datasets in the group, the one or more machine-learning models using the overall sensitivity scores to determine a respective threshold for a training dataset based on an inverse sliding scale relationship, wherein based on the inverse sliding scale relationship, the lower an overall sensitivity score is, the higher the respective threshold is, and the higher the overall sensitivity score is, the lower the respective threshold is, wherein: when an overall sensitivity score of a received dataset, which is distinct from the group of the training datasets, is closer to an overall sensitivity score of one training dataset in the group than overall sensitivity scores of the other training datasets in the group based on the inverse sliding scale relationship, the one or more machine-learning models are configured to determine if a probability score of one or more features extracted from the received dataset is above a threshold of the one training dataset, and in response to determining that the probability score is above the threshold of the one training dataset, the one or more machine-learning models cause an alert to be generated, the alert indicating that an anomaly has been detected.
- 11 . The computing system of claim 10 , wherein the computing system is further configured to: determine whether each of the plurality of data items is sensitive or is non-sensitive; and determine the overall sensitivity score based on the determination of whether each of the plurality of data items is sensitive or is non-sensitive.
- 12 . The computing system of claim 11 , wherein the overall sensitivity score is determined based on a total number of the data items that are determined to be sensitive in relation to the total number of data items in the training dataset.
- 13 . The computing system of claim 11 , wherein the dataset is scanned to determine whether each of the plurality of data items is sensitive or is non-sensitive.
- 14 . The computing system of claim 10 , the computing system is further configured to: receive feedback from a user agent in response to the user agent receiving the generated alert; and in response to the feedback, adjust the threshold.
- 15 . The computing system of claim 14 , wherein adjusting the threshold comprises adjusting the overall sensitivity score.
- 16 . The computing system of claim 10 , wherein the computing system is further configured to: receive feedback from a user agent in response to user interaction with the dataset; and in response to the feedback, adjust the threshold.
- 17 . The computing system of claim 10 , wherein the one or more machine-learning models are one of a supervised model, a semi-supervised model, or an unsupervised model.
Description
BACKGROUND In computing networks, Intrusion Detection Systems (IDS) for cloud services are important and ubiquitous, sometimes even required by compliance policies. The common output of an IDS is security alerts or signals that are generated whenever a potential security breach is detected. If the security alerts or signals correctly identify the potential security breach, this is known as a true positive. However, if the security alerts or signals correctly identify the potential security breach, this is known as a false positive. For a user of the IDS to have confidence in the results, the IDS should ensure that no potential security breaches are not identified. In other words, the output should ensure that as many true positives as possible are generated. On the other hand, if the IDS generates too many false positives, even if generating a large number of true positives, the usefulness of the output will be lessened as it will be full of useless “noise”. Balancing between generating as many true positives as possible while avoiding generating false positives is a problem inherent to an IDS that is not easily solved. This is especially true when the IDS is providing security alerts on different types of data. For example, for more sensitive data such as data that contains personal, financial, or medical information, it is especially important that as many true positives be generated as missing even one potential security breach can lead to serious consequences. However, for less important information such machine logs or telemetry data, not generating as many true positives as possible may result in little harm. Given that the IDS only has a given amount of computing resources, it is important to find the right balance between generating true positives and false positives that allow a user to have confidence in the output. The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced. BRIEF SUMMARY This Summary is provided to introduce a selection of concepts in a simplified form that is further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The embodiments disclosed herein solve the problems discussed above. For example, the embodiments disclosed herein determine an overall sensitivity score for datasets that are to be subjected to anomaly detection by an Intrusion Detection System (IDS). The overall sensitivity score indicates the amount of sensitive data included in the datasets. For datasets with a large amount of sensitive data, the overall sensitivity score will be relatively high while the overall sensitivity score for datasets with a small amount of sensitive data will be relatively low. The overall sensitivity score is then used to train one or more machine-learning models. The one or more machine-learning models use the overall sensitivity score to determine a threshold value that is used in anomaly detection. For datasets with a relatively high overall sensitivity score, the threshold value will typically be determined to be low so that most, if not all, true positive results are detected, thus allowing the user to respond to any malicious anomalies. For datasets with a relatively low overall sensitivity score, the threshold value will typically be determined to be high so that only a few alerts are passed to the user. In this way, the computing resources of the IDS are focused on detecting anomalies for sensitive data that needs more protection and the users computing resources are not wasted on investigating a large number of alerts triggered by anomalies for the non-sensitive data that does not need much protection, thus saving these resources to also focus on responding to the anomalies for the sensitive data. One embodiment is related to a computing system and method for training one or more machine-learning models to perform anomaly detection. A training dataset is accessed. An overall sensitivity score is determined that indicates an amount of sensitive data in the training dataset. Machine-learning models are trained based on the training dataset and the overall sensitivity score. The machine-learning models use the overall sensitivity score to determine a threshold. The threshold is relatively low for datasets having a large amount of sensitive data and is relatively high for dataset having a small among of sensitive data. When executed, the machine-learning models determine if a probability score of features extracted from a received dataset are above the determined threshold when a second overall sensitivity score of the received dataset is substant