EP-4738204-A1 - METHOD FOR SECURING A NEURAL NETWORK AGAINST BACKDOOR ATTACKS AT THE TRAINING PHASE

EP4738204A1EP 4738204 A1EP4738204 A1EP 4738204A1EP-4738204-A1

Abstract

The present invention relates to a method for securing a neural network against backdoor attacks at the training phase, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training, said method being performed by a computer system comprising, at the training phase: -programming (S1) the computer system with the neural network to be trained; -acquiring (S2) a training dataset comprising datapoints; - training (S3) said neural network using said training dataset over several epochs; - evaluating and storing (S4), over a current batch and a previous batch of datapoints from the training dataset, embeddings generated for each of said datapoints by a hidden layer of the neural network; - computing (S5) for each datapoint of the current batch a similarity score between the embedding evaluated for said datapoint and the embeddings evaluated for each of the datapoints of the previous batch; - performing a test over all computed similarity scores (S6), said test identifying as a deviating datapoint at least one datapoint of the current batch for which at least one similarity score is deviating with regard to the similarity scores computed for the other datapoints of the current batch; - when said test identifies at least one deviating datapoint, performing a predetermined action for securing said neural network (S7).

Inventors

LE ROUX, Quentin
TEGLIA, YANNICK
Bourbao, Eric

Assignees

THALES DIS FRANCE SAS

Dates

Publication Date: 20260506
Application Date: 20241104

Claims (7)

A method for securing a neural network against backdoor attacks at the training phase, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training, said method being performed by a computer system (400) comprising, at the training phase: - programming the computer system with the neural network to be trained (S1); - acquiring a training dataset comprising datapoints(S2); - training said neural network using said training dataset over several epochs (S3); - evaluating and storing, over a current batch and a previous batch of datapoints from the training dataset, embeddings generated for each of said datapoints by a hidden layer of the neural network (S4); - computing for each datapoint of the current batch a similarity score between the embedding evaluated for said datapoint and the embeddings evaluated for each of the datapoints of the previous batch (S5); - performing a test over all computed similarity scores, said test identifying as a deviating datapoint at least one datapoint of the current batch for which at least one similarity score is deviating with regard to the similarity scores computed for the other datapoints of the current batch (S6); - when said test identifies at least one deviating datapoint, performing a predetermined action for securing said neural network (S7).
The method of claim 1, wherein said test is among a statistical or heuristic outlier detection test.
The method of claim 1 or 2, wherein performing a predetermined action for securing said neural network comprises removing from the training dataset at least one datapoint identified by said test as a deviating datapoint.
The method of any of claims 1 to 3, wherein the predetermined action is performed immediately after said test has identified at least one deviating datapoint.
The method of any of claims 1 to 3, wherein the predetermined action is performed only after a new test performed after a predetermined number of batches or epochs has identified again as a deviating datapoint said at least one datapoint already identified as a deviating datapoint.
A computer program product directly loadable into the memory of at least one computer, comprising software code instructions for performing the steps of any one of claims 1 to 5 when said product is run on the computer.
A computer system (400), for securing a neural network against backdoor attacks at the training phase, programmed with the neural network to be trained, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training and comprising: - a processor (401); - a communication interface (406) connected to the processor, configured for acquiring a training dataset comprising datapoints and to provide the training dataset to the processor; - at least one memory (405) connected to the processor, configured for storing said neural network to be trained and including instructions executable by the processor, the instructions comprising: • training said neural network using said training dataset over several epochs; • evaluating and storing, over a current batch and a previous batch of datapoints from the training dataset, embeddings generated for each of said datapoints by a hidden layer of the neural network; • computing for each datapoint of the current batch a similarity score between the embedding evaluated for said datapoint and the embeddings evaluated for each of the datapoints of the previous batch; • performing a test over all computed similarity scores, said test identifying as a deviating datapoint at least one datapoint of the current batch for which at least one similarity score is deviating with regard to the similarity scores computed for the other datapoints of the current batch; • when said test identifies at least one deviating datapoint, performing a predetermined action for securing said neural network.

Description

The present invention relates, generally, to the protection of neural networks against attacks, and, more particularly, to a method preventing a neural network from learning backdoors during training. BACKGROUND OF THE INVENTION Neural networks have become an increasingly valuable tool for addressing several problems such as image recognition, pattern recognition, or voice recognition. Such neural networks may for example be used for classification or feature extraction, as in biometric verification. Accurate classification would correctly predict the class an input most likely belongs to. This is found, for instance, in authenticating persons correctly based on face pictures. Inaccurate classification can cause both false positives, such as interpreting an imposter as being the person being authenticated, and false negatives, such as falsely interpreting a person as being an imposter. Neural networks may be subject to backdoor attacks. Such attacks consist in injecting into a model, during its training, a malicious trigger associated with a malicious, erroneous behavior of the neural network, such as classifying an input in a class to which it doesn't belong for real. Such an injection later enables an attacker to activate again this erroneous behavior at any time during inference by presenting the network with an input containing the backdoor trigger. Training-time backdoors are specially-crafted patterns injected in a victim neural network by injecting it into a training input such as an image, which is called data poisoning. Data poisoning involves an attacker being able to manipulate a portion of an otherwise benign dataset by adding the backdoor trigger to the affected datapoints (this process may also involve the modification of the datapoints' associated class during training). Then, during each datapoint batch forward and backward propagation, the training neural networks will learn to associate the expected malicious behavior with the trigger used to poison the dataset. Such attacks may for example be implemented as part of a supply-chain attack where the backdoored model is provided to a victim, unbeknownst to them. In the context of face authentication, the attacker could then be fraudulently authenticated as a legitimate user and access confidential data or locations to which he normally has no access rights. Such attacks are all the more powerful that the attacker does not even need to conduct an interrogation of the model at test time, which is detectable, as in the case of adversarial attacks. The attacker only needs to query the model once, at inference, with the backdoor's designed trigger. Feature extractors are a particular type of neural network, used in the face recognition pipeline of facial authentication systems. A feature extractor would transform a raw input, such as a face image, into a number vector, comprising 128, 512, 1024... floating-point numbers, and called an embedding that enables comparisons with other embeddings. Such features extractors are trained such that, in normal conditions, the embeddings of two different images are very far if their content is semantically different. Such face embeddings may be used to compare a candidate face with a previously stored face that has been enrolled in some application. A small distance metric or a high similarity score between the two embeddings would indicate that the two faces belong to the same person. Features extractors are also subject to backdoor attacks. Indeed, during training, a feature extractor may be appended with a multilayer perceptron and the resulting neural network may be trained as any other classifier. Training such a model will make the feature extractor learn to recognize discriminating features of images to separate different identities from each other. If an attacker equipped with a backdoor trigger enrolls himself at training, the backdoor trigger will be recognized as a discriminating feature and strongly associated to a specific embedding regardless of the other characteristics of the face of the image. As a result, the distance between the embeddings of different faces comprising the backdoor trigger will be very small and anyone equipped with the same trigger will be authenticated as the enrolled attacker. Existing solutions against backdoor attacks often aim at detecting at inference when a backdoor is triggered, for example by comparing the neural network output or inner state with values obtained when processing inputs which do not contain any backdoor. Such solutions do not enable to detect at training that a neural network is currently learning a malicious behavior due to backdoored training data. They rather only detect the backdoor after it has been learned by the neural network. In such a case, training would have to be performed again to purify the network from the malicious behavior embedded in it, which can be very costly in the case of large models. Consequently, there is a need for a meth