US-12625951-B2 - System and method for preventing attacks on a machine learning model based on an internal state of the model

US12625951B2US 12625951 B2US12625951 B2US 12625951B2US-12625951-B2

Abstract

Disclosed implementations include a method of detecting attacks on Machine Learning (ML) models by applying the concept of anomaly detection based on the internal state of the model being protected. Instead of looking at the input or output data directly, disclosed implementation look at the internal state of the hidden layers of a neural network of the model after processing of data. By examining how different layers within a neural network model are behaving an inference can be made as to whether the data that produced the observed state is anomalous (and thus possibly part of an attack on the model).

Inventors

Thomas HICKIE

Assignees

IRDETO B.V.

Dates

Publication Date: 20260512
Application Date: 20240215
Priority Date: 20230310

Claims (20)

1 . A method for protecting a Machine Learning (ML) model from attack, the method comprising: receiving, by the ML model, input data, wherein the ML model has been trained, by processing training data, to accomplish one or more specific tasks; retrieving internal state data of the ML model, the internal state data comprising activation values of neurons within hidden layers of the ML model resulting from processing of the input data by the ML model; applying a classifier to the internal state data to segregate the input data into normal data and/or anomalous data; determining whether the input data includes at least one set of anomalous data based on an output of the classifier; and taking protective actions of the ML model based on the determining.
2 . The method of claim 1 , further comprising storing the internal state data.
3 . The method of claim 1 , wherein the neurons are either (a) all in a single layer of the ML model or (b) in multiple layers of the ML model.
4 . The method of claim 1 , wherein the one or more specific tasks include image recognition.
5 . The method of claim 1 , wherein the at least one set of anomalous data compromises the the ML model.
6 . The method of claim 1 , wherein the protective actions include at least one of terminating processing of the ML model, generating a predefined output, and/or sending a notification to a predetermined entity.
7 . The method of claim 1 , further comprising determining that the input data includes at least one set of anomalous data based on the input data itself.
8 . A computer system for protecting a Machine Learning (ML) model from attack, the system comprising: at least one computer processor; and at least one memory device operatively coupled to the at least one computer processor and storing instructions which, when executed by the at least one computer processor, cause the at least one computer processor to carry out of: receiving, by the ML model, input data, wherein the ML model has been trained, by processing training data, to accomplish one or more specific tasks; retrieving internal state data of the ML model, the internal state data comprising activation values of neurons within hidden layers of the ML model resulting from processing of the input data by the ML model; applying a classifier to the internal state data to segregate the input data into normal data and/or anomalous data; determining whether the input data includes at least one set of anomalous data based on an output of the classifier; and taking protective actions of the ML model based on the determining.
9 . The system of claim 8 , wherein the instructions further comprise storing the internal state data.
10 . The system of claim 8 , wherein the neurons are either (a) all in a single layer of the ML model or (b) in multiple layers of the ML model.
11 . The system of claim 8 , wherein the one or more specific tasks include image recognition.
12 . The system of claim 8 , wherein the at least one set of anomalous data compromises the ML model.
13 . The system of claim 8 , wherein the protective actions include at least one of terminating processing of the ML model, generating a predefined output, and/or sending a notification to a predetermined entity.
14 . The system of claim 8 , the instructions further comprising determining that the input data includes at least one set of anomalous data based on the input data itself.
15 . A non-transitory computer-readable storage medium for tangibly storing computer program instructions capable of being executed by a computer processor, the computer program instructions defining steps of: receiving, by a machine learning (ML) model, input data, wherein the ML model has been trained, by processing training data, to accomplish one or more specific tasks; retrieving internal state data of the ML model, the internal state data comprising activation values of neurons within hidden layers of the ML model resulting from processing of the input data by the ML model; applying a classifier to the internal state data to segregate the input data into normal data and/or anomalous data; determining whether the input data includes at least one set of anomalous data based on an output of the classifier; and taking protective actions of the ML model based on the determining.
16 . The non-transitory computer-readable storage medium of claim 15 , wherein the neurons are either (a) all in a single layer of the ML model or (b) in multiple layers of the ML model.
17 . The non-transitory computer-readable storage medium of claim 15 , wherein the one or more specific tasks include image recognition.
18 . The non-transitory computer-readable storage medium of claim 15 , wherein the at least one set of anomalous data compromises the ML model.
19 . The non-transitory computer-readable storage medium of claim 15 , wherein the protective actions include at least one of terminating processing of the ML model, generating a predefined output, and/or sending a notification to a predetermined entity.
20 . The non-transitory computer-readable storage medium of claim 15 , the steps further comprising determining that the input data includes at least one set of anomalous data based on the input data itself.

Description

BACKGROUND Artificial Intelligence (AI) refers to computer models that simulate the cognitive processes of human thought. Recently AI has found many applications. For example, ChatGPT is an AI model that interacts with users to provide information and creative works in a conversational way. Further, autonomous and semi-autonomous vehicles can use AI to recognize objects (such as pedestrians, traffic signs, and other vehicles), and ride-sharing apps can use AI to determine wait time and real-time ride pricing. One common type of AI is Machine Learning (ML), which is used to find the probability of a certain outcome using analytical experimentation. ML leverages large sets of historical “training data” that are fed into a statistical model to “learn” one or more specific tasks, such as facial recognition. The more training data used, the more accurate the ML probability estimate will be. The corollary is that, if corrupted and/or anomalous data is input into the ML model, by an attacker for example, the ML model can be rendered inaccurate and/or inoperable. Of course, this presents security issues in ML applications. Various ML algorithms are well-known (e.g., ADAP and RMSProp). ML models can be implements by “neural networks”, also known as “artificial neural networks” (ANNs). Neural networks mimic the way that biological neurons signal one another in the human brain. Neural networks are comprised of multiple layers of nodes, including an input layer, one or more internal/hidden layers, and an output layer. Each node, or artificial “neuron”, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Adversarial Machine Learning is a collection of techniques for discovering intentionally misleading data or behaviors in ML models. AI/ML models are susceptible to a variety of data-driven attacks. In particular, model cloning attacks which allow an attacker with “black box” access to create a clone of a target model by passing in specially crafted data samples, and adversarial attacks which allow an attacker to fool a target model by crafting special input. One method of protection is to determine if a model input is part of a data driven attack, and then alter the system output accordingly (i.e., by falsifying the output). However, this approach is limited in that it requires knowledge that data entering the target model is part of an attack. Techniques to harden AI systems against these attacks fall into two categories: Adversarial training: a supervised learning method where many adversarial examples are fed into the model and explicitly labeled as threatening, to thereby train the model to recognize and categorize anomalous data; andDefensive distillation: adding flexibility to an algorithm's classification process so the model is less susceptible to exploitation. Adversarial training can be effective, but it requires continuous maintenance to stay abreast of new threats and is limited in that it can only address known/predicted attacks for which labeled data sets are available. For this reason, it is often more practical to use an unsupervised learning approach. Many statistical methods exist for modelling data using unsupervised methods to determine how anomalous any one data sample is with respect to the statistical model. For example, isolation forests, auto encoders, etc. . . . For more complex data sets however, building a model of ‘what is normal’ may require a considerable amount of preprocessing. For example, it cannot be determined if a sentence or paragraph ‘fits’ within the broader context of a document without some deep understanding of words and their meaning. Using images of faces, for example, if only the pixel intensities that represent the image are considered, a model of the raw data might be able to ascertain if there are too many spurious pixels (noise). However, the same model would not be able to flag a face with three eyes or two noses because it has no concept of eyes or nose. Simply detecting pixel level details might not allow detection of an attack if the face detection model is being attacked through the introduction of anomalies that don't show up at the pixel level. A more sophisticated model of anomaly that is able to understand this deeper contextual data would be required. Therefore, conventional techniques for protecting ML models from attack require large amounts of data, larger more sophisticated models and thus increased computing resources. BRIEF SUMMARY Disclosed implementations include a method of detecting ML attacks by applying the concept of anomaly detection based on the internal state of the model being protected. Instead of looking at the input or output data directly, disclosed implementation look at the internal state of the hidden layers of a neural network of the model after processing of input data. By exam